Summary: Parsers and Taggers

Ray Liere (lierer@mail.CS.ORST.EDU)
Mon, 8 Jul 1996 10:35:07 -0700

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Jeff Adams: "Re: Corpus Linguistics Textbook"
Previous message: Cashen Louise: "Corpus Linguistics Textbook"

Regarding my recent inquiry about parsers and taggers, below I have
repeated my original posting for reference and then summarized
the responses that I received.

The "summarization" has been very conservative ... basically
removing/condensing the ">" references back to my original posting
and reformatting some long lines (which I have trouble reading
using my mailer ... so I fixed them since maybe others have the
same problem, too).

I use the term "original posting" for a posting that seeks information.
The other items in the summary are responses to the various original
postings.

I have at this point simply listed the responses in *chronological*
order, as (1)this is the most expeditious way of getting the information
out to everyone quickly, which I feel is the main consideration,
(2)it does have merits in terms of being able to more easily follow
threads, and (3)in all honesty, I did not see an obvious better way
to organize the information at this point.

For completeness, I have included all responses (even those that
were previously posted to the Corpora Mailing List).

Also, recall that
> a list of parsers was posted with respect to investigations
> on the use of them on PCs under windows in April 1996 [... most]
> responses were emailed to the original poster, but no copy was sent
> to corpora and no summary was posted.
The original poster emailed to me the responses he had received
to his posting. To keep the summary chronological, I have therefore
started with his April 1996 posting.

And I have also included an original posting by John Kirk (and responses
to it).

[Life should not be so complicated ... :-) ]

I plan on adding to the summary as I come upon other parser/tagger
related information, so the continuation of the chronological approach
is probably not the best ... if anyone has a suggestion that does not
require massive rewriting efforts, I would certainly appreciate an email.
It seems that one possibility is to have a section for each system, but
then that seems to imply replicating responses that are applicable
to 2+ systems. I do not have time to do massive rewrites, but I do want to
make the information readily available to those interested.

My thanks to everyone who sent information -- it has been very helpful.
It is gratifying to receive so much help! I have tried to be careful in
the summarization process, but I apologize if I have introduced any errors.

Thanks to everyone for sharing their knowledge.

Ray Liere
lierer@mail.cs.orst.edu

===================================
++Original Posting {this one by: RS-FORSYTH@wpg.uwe.ac.uk (Richard Forsyth)}
++Subject: taggers
++Date: Thu, 18 Apr 1996 17:10:18 +0000

i am looking for a cheap robust part-of-speech tagger
for English that runs on PCs & handles plain ASCII text.
structural analysis is optional as long as the word
tagging is consistent & fairly accurate.
(Latin also wd be nice, but i'm not hopeful about that!)

i'm aware of some, such as micro-EYEBALL, CLAWS, ENGCG,
AUTASYS, & the `xerox tagger' but as far as i know
none of them quite fulfills my spec. E.g.
EYEBALL -- low-cost, but not fully automatic,
requires human interaction;
CLAWS -- only VAX Pascal source in public domain (?i think);
ENGCG -- good but expensive (c. $1500);
AUTASYS -- good but not cheap (c. ?500);
Xerox tagger -- only runs under unix (?i think).

[latter feature is a disadvantage from my point of view as i'm
not very unix-literate & want to do most of the work on
my home PC.]

corrections &/or additions to the above imperfect
`knowledge base' wd be appreciated.

thanks,
richard forsyth (UWE Bristol).

===================================
++From: nick@comp.lancs.ac.uk

> CLAWS -- only VAX Pascal source in public domain (?i think);

This has moved on to 'C' from pascal , but unfortunately for you is unix
rather than pc-based.

There's a program called PC-KIMMO available somewhere; it may be shareware,
I'm not sure.

Nick.

===================================
++From: brill@crabcake.cs.jhu.edu (Eric Brill)

I have a tagger available from my web site:

http://www.cs.jhu.edu/~brill

It was developed for unix, but has been used by many people on pcs.
(I think you need at least 16 meg of ram, preferably 32 meg). You
could see if it runs "out of the box" on your pc. If not, only minor
changes should be needed.

-Eric

===================================
++From: mosborne@csd.abdn.ac.uk (Miles Osborne)

Well, I suggest porting Brill's tagger (available from the CMU archive).
It's in C, so it should be fairly straightforward. A word of warning though:
it needs lots of memory to run (ie. I can't get it to run on my Linux box
using 8Megs of real memory).

Miles Osborne

===================================
++From: bro@grove.ufl.EDU (John Bro)

If you can compile c, and run perl, Eric Brill's RuleBased tagger
will come close to filling the bill for you. I'm running it under unix,
but the source code is freely available. It runs without human intervention,
once it's set up and has learned the rules relative to your corpus/tagset.

It's available in a number of archives, but also directly from Prof. Brill
brill@blaze.cs.jhu.edu
at:
ftp.cs.jhu.edu/pub/brill

Good luck!
John Bro
Linguistics
UF Gainesville

===================================
++From: johnca@cogs.susx.ac.uk (John Carroll)

WORKSHOP ON ROBUST PARSING - CALL FOR PAPERS
August 12 - 16, 1996

at ESSLLI'96
European Summer School in Logic, Language and Information
Prague, Czech Republic

BACKGROUND:
Parsing systems able to analyse natural language text robustly and
accurately at an appropriate level of detail would be of great value in
computer applications ranging from speech synthesis and document style
checking to message understanding and automatic translation. A number of
research groups worldwide are currently developing such systems, varying
in the depth of analysis from lexical parsing or tagging (identifying
syntactic features just of individual words), through shallow or phrasal
parsing (forming hierarchical syntactic structure but not exploiting
subcategorisation), to full parsers (which deal with unbounded
dependencies etc., and are able to recover predicate-argument structure).

To bring researchers in this area together to present and compare
state-of-the-art systems for robust parsing, a workshop will be held
August 12-16, 1996, during the first week of ESSLLI'96, the European
Summer School in Logic, Language and Information.

We invite the submission of papers describing implemented robust parsing
systems; also evaluations, comparisons, and critiques of different parsing
systems or technologies. The main aim of the workshop is to identify the
strengths and weaknesses in the diverse set of approaches currently being
investigated, and to discuss areas that require further work.

To facilitate comparison between systems, authors of accepted papers will
be supplied with a small corpus of 30 sentences and encouraged to run
these through their systems, using simple (supplied) criteria to evaluate
the results.

WORKSHOP STRUCTURE:
The workshop will consist of 5 90-minute sessions, with two papers in each
session. Please note that speakers will be expected to register for ESSLLI
(thus being eligible to attend all other workshops, as well as the many
courses and symposia). There is a small amount of money available to go
towards the expenses of those who have no other source of funding.

ORGANISERS:
John Carroll, University of Sussex
and Ted Briscoe, University of Cambridge

SUBMISSION DETAILS:
Authors should submit an extended abstract (2000-3000 words) either
electronically or as hard-copy. Electronic submissions must be either
plain ascii text or a single latex file. Your e-mail address should appear
on your paper, and unless requested otherwise, all further correspondence
will be conducted via e-mail.

SCHEDULE:
Submission Deadline: May 31
Notification of Acceptance: June 21
Final Papers for Inclusion in Proceedings: July 19
Workshop Dates: August 12-16, 1996

WORKSHOP SUBMISSIONS TO:
John Carroll
Cognitive and Computing Sciences,
University of Sussex,
Falmer, Brighton BN1 9QH, UK
E-mail: john.carroll@cogs.susx.ac.uk

SUMMER SCHOOL CONTACT:
ESSLLI'96,
UFAL MFF UK,
Malostranske' na'm. 25,
118 00 Praha 1,
Czech Republic
Fax: +42-2-2191-4-309
Phone: +42-2-2191-4-255
E-mail: esslli@ufal.mff.cuni.cz
WWW: http://ufal.ms.mff.cuni.cz

===================================
++From: richard.sutcliffe@ul.ie (Richard Sutcliffe)

> [regarding Brill tagger]
Yes you need a lot of memory.

===================================
++From: gelman@xsoft.xerox.com (Andrew Gelman)

Ken Beesley atRank Xerox Research Centre in Grenoble passed on your note
about cheap, robust taggers. The Xerox tagger has been productized by us
at XSoft (in Palo Alto). It is available for SunOS 4.1.x, Solaris,
& Windows 16- or 32-bit. Language availability is English, French
and German right now, but the RXRC folks are building several other
European-language taggers now.

A commercial license is too costly (I infer from your note) but if the
tagger is to be used for academic research only, a license can be
arranged for a nominal fee. Interested? Let me know.

Cheers,
Andrew Gelman

Manager, Lexical Technologies
XSoft
3400 Hillview Avenue
Palo Alto, CA 94304 USA
gelman@xsoft.xerox.com
Tel. 415 813-7194

===================================
++From: gelman@xsoft.xerox.com (Andrew Gelman)

We do have a PC version, but you've got to have windows to run it.
There is a demo program that loads up the tagger and lets you type in
input or specify text file input.

I will send you a software evaluation form, which basically says you
won't use it in any untoward ways. I can include a copy of the PARC tagger
paper with it. Please give me a mailing address and we'll get going.

Andy

===================================
++Original Posting {this one by: jkirk@clio.arts.qub.ac.uk (John Kirk)}
++Subject: Commercial and internet taggers
++Date: 05/07/1996 03:51 pm (Tuesday)

Despite all that we hear about grammatical taggers, and for all the
proliferation of tagsets, am I right in thinking that for some ordinary
soul, not part of a major project developing taggers, but eager to tag some
texts, there is still only one commercial option and two internet options?

The commercial option is to BUY the original Nijmegen TOSCA tagger
(i.e. not the ICE tagger, which is only available to participants in that
project).

The INTERNET options are to send smallish amounts of material
EITHER to the ENGCG tagger in Helsinki (ENGCG@ling.helsinki.fi) OR to the
on-line Birmingham tagger (tagger@clg.bham.uk.ac). These appear to be free.

Can anyone add to my list? This must be a huge but still
unsatisfied need, not least by students eager to work on their own
material. My query is simply about availability for tagging particularly,
if possible, large amounts of text, preferably for free - not about the
quality of the tagging.

I'll be glad to post a summary of replies.

With thanks again,

John Kirk

===================================
++From: U279206%HNYKUN11.bitnet@HEARN.nic.SURFnet.nl (Henk Barkema)

John Kirk wrote:
> The commercial option is to BUY the original Nijmegen TOSCA tagger
>i.e. not the ICE tagger, which is only available to participants in that
>project).

The TOSCA ICE tagger can, in fact, be bought by anyone who wants a copy.
Its price is 500 Dutch Guilders (approx. 185 pounds Sterling).
In April we have released a new, extended and improved
version of the ICE tagger, which is available for the same price.

By the end of August another tagger will be operational, the TOSCA
tagger-lemmatizer, which will make use of a more comprehensive
and more consistent tagset, and which will be accompanied by
a new tagging manual. Its price will be published at the
moment of its release.

For more information, please contact me, or send an e-mail message to:

tosca@let.kun.nl

Henk Barkema - University of Nijmegen.

.................................................
........ drs. H.L. Barkema (barkema@let.kun.nl) .
........ TOSCA Research Group ..................
....... Department of Language and Speech ....
...... Faculty of Arts .....................
..... University of Nijmegen .............
.... P.O. Box 9103 .....................
... 6500 HD - Nijmegen ...............
.. The Netherlands .................

===================================
++Original Posting (Ray Liere <lierer@mail.CS.ORST.EDU>):
++Subject: Parsers/taggers -- free and good?
++Date: Wed, 26 Jun 1996 11:43:29 -0700

I am interested in your opinions and *especially* your experiences
using any parser/tagger for English that has these characteristics:
- handles free text (by which I mean not in any way cleaned up -- such
as from newswires, technical reports, manuals, etc.)
- free
- source code is available, preferably in C or C++. I need source so that
I can port it to Linux (a version of UNIX that runs on PCs).
- reasonably accurate and easy to use

The ability to handle (ignore?) structure tags, such as title, body of
document, etc., is not a big deal, as I think that they would be easy
to strip out in a preprocessing step

I realize that there have been a few relevant postings to this mailing list
of late -- a list of parsers was posted with respect to investigations
on the use of them on PCs under windows in April 1996, and Miles Osborne
posted a response suggesting the Brill parser. Unfortunately, other
responses were emailed to the original poster, but no copy was sent
to corpora and no summary was posted.

It seems that the topic of "what is a good cheap parser" comes up
periodically, so I would like to volunteer to gather people's
experiences -- good and bad -- and then post them.

Email your thoughts to me if you prefer (to save bandwidth) -- I will
post a summary of responses that I receive via email. If you prefer to
have your comments summarized anonymously, please indicate this
in your email.

Thanks.

Ray Liere
Department of Computer Science
Oregon State University, Corvallis, Oregon, USA
lierer@mail.cs.orst.edu

===================================
++From: "Evan L. Antworth" <Evan.Antworth@SIL.ORG>

Our PC-PATR program meets some of your criteria (free, C sources), but it's
a generalized parser for which you must supply the language-specific
grammar. A toy English grammar comes with it, intended for demonstration
purposes. I don't know of any wide-coverage PATR grammar for English that
is publicly available. Here's the address for more information:

http://www.sil.org/pcpatr/

--Evan

Evan Antworth | e-mail: evan.antworth@sil.org
Academic Computing Department | phone: 214-709-3346
Summer Institute of Linguistics | fax: 214-709-3363
7500 W. Camp Wisdom Road
Dallas, TX 75236
--------
World Wide Web: http://www.sil.org
Gopher: gopher.sil.org
FTP: ftp.sil.org [198.213.4.1]
Mailserver: mailserv@sil.org (send "help" message)

===================================
++From: Chris Brew <chrisbr@cogsci.ed.ac.uk>

The Language Technology Group (where I work, so this is advocacy,
right?) has an HMM-based part of speech tagger written in C++ called
LT-POS. Although this isn't released yet, I expect it will be released
under the same terms as our other software: academics can have a free
copy for research purposes, industrial research groups can pay a small
fee to use it for evaluation and research, while more substantial
commercial use needs separate negotiation. [Thats just my summary, see
http://www.ltg.hcrc.ed.ac.uk/ for more detail]. LT-POS is developed
and maintained by Andrei.Mikheev@edinburgh.ac.uk, and commercial
inquiries should go to Marc.Moens@edinburgh.ac.uk.

LT-POS has a plain text mode and a mode which accepts SGML marked-up
documents. In the latter case avoiding structure tags is easy, in the
former, contrary to your claim, it is, in general, difficult, because
there are so many orthographic conventions for signalling title,
section headings and so on. If you have documents with strict
conventions about these things the pre-processing will indeed be easy.
Otherwise your mileage will vary. I always recommend working from
SGML marked up documents when possible, since there are good tools,
including ours, for manipulating and querying such documents, and
because it is not hard to massage HTML documents into valid SGML,
giving access to much of the World-Wide Web.

People who take on the task of processing arbitrary text always find
that the work of pre-processing is much more demanding than they
anticipated. The Xerox part-of-speech tagger
(ftp://ftp.parc.xerox.com/pub/tagger), has an excellent tokeniser, and
can be applied to unprocessed files. However, you need a good enough
Common Lisp implementation (CMU Lisp is free and works on Suns, as
well as , I believe, NetBSD, but not Linux). LT-POS is designed to
provide a superset of the facilities, but obviously hasn't shaken
down to the same extent yet).

> [...] Miles Osborne
> posted a response suggesting the Brill parser.

Brill's system, like ours is just a part of speech tagger. It assigns
categories to words, not syntax trees to sentences. The latter is a far
harder task, to the point where I am not really happy recommending any
free system as a tool for general text processing.

Brill's tagger needs sentences one per line, other free taggers,
including Helmut Schmid's tree tagger, (from IMS, Stuttgart) can want
words one per line. Most of them will do something moderately sensible
with unknown words, so modulo tokenisation it is probably fair to say
that they can all handle free text to some extent.

> It seems that the topic of "what is a good cheap parser" comes up
> periodically, so [...]

"Good" is a task dependent term (actually, so is "cheap"). My strategy
would be:
- think hard about the task you have in mind, and look at some
tagger output to determine whether correct part of speech tagging,
if available, would meet your needs. (If you want to be sure about
this, you would need a pilot study, but introspection is usually
a useful guide)

- look again at the tagger output, focussing on the 2-10% of errors
which you will probably encounter. Do these errors cause a problem
for your application?

- if by this stage, you are still happy with POS tagging as a
solution, do a study evaluate some taggers ( I would suggest trying
LT-POS, Brill and Xerox, in that order, but I'm not claiming that
these are necessarily the best, since my choice is largely dictated
by convenience of use and installation in my current working
environment).

- if you really need a full parser, life gets harder. There is lots of
nice parsing technology, but nothing which is drop-dead useful. One of
the obvious candidates is Sleator and Temperley's Link Grammar
(<http://bobo.link.cs.cmu.edu/cgi-bin/grammar/build-intro-page.cgi>,
ftpable, free, C source code available, unconventional syntax annotations),
and another is the ENGCG system developed in Helsinki (not free, doesn't
produce much more detail than a POS tagger, but good). Again, you need
to evaluate the output against the needs of your task.

Hope this helps a little

Chris

===================================
++From: Miles Osborne <mosborne@csd.abdn.ac.uk>

Hi there. A slight correction: I suggested the Brill *tagger* (but
I'm sure you knew that already). As for parsers, I'm not sure if there
are any. However, the link parser (at CMU I believe) may be in C.
It claims to handle unrestricted text, so it might be useful. I've no
idea how accurate it is though.

Good to see that others are using linux too! As a by-the-way, you'll
need more than 8megs to run Brill's tagger (unless of course you don't
mind listening to your disk thrash).

Miles

===================================
++From: E S Atwell <eric@scs.leeds.ac.uk>

Ray,
Before chosing a parser to re-use you should check out

Richard Sutcliffe, Heinz-Detlev Koch, and Anne McElligott (eds),
"Industrial Parsing of Technical Manuals", Amsterdam, Rodopi. 1996.

This is a (to-be-)published version of
Richard Sutcliffe, Heinz-Detlev Koch (eds),
International Workshop on Industrial Parsing of Software Manuals 1995",
University of Limerick, Ireland, 1995.

This workshop evaluated 9 different parsers which meet most of your
criteria:
- robust - tested on a set of `real' sentences from software manuals
(but preprocessing allowed to `clean up' - though this was documented)
- some of these at least are free
- code available, again at least for some; a couple of parsers were
`represented at 2nd hand', i.e. evaluated by researchers who had copied
from originators by ftp etc
- as for accuracy and ease of use, see the book - each parser has a
chapter with tables of evaluation metrics.

One criterion you don't mention is:
- parse-tree output is suitable format and parsing scheme for my application

The parses output by the rival parsers looked very different, showing
different aspects of English grammar - you should have a clearer
idea about what your `ideal output' of a parser should be, eg do you
want phrase bracketing, phrase labelling, dependency links,
verb valency and subject/object functional relations, subcategory
features, deep/logical relations (eg raising), syntactic rank
information, special marking of abnormal syntax (eg headlines, speech),
etc, etc ?? It's not as easy as you might think `mapping' a given
parser output to your `target' preferred analysis scheme - see
my chapter in the above book, or our WWW page for project AMALGAM
(Automatic Mapping Among Lexico-Grammatical Annotation Models),
http://agora.leeds.ac.uk/ccalas/amalgam.html

hope this helps,

regards,
Eric
___________

Eric Steven Atwell, Director,
Centre for Computer Analysis of Language And Speech (CCALAS)
Artificial Intelligence Division, School of Computer Studies
The University of Leeds, LEEDS LS2 9JT, Yorkshire, England
TEL:0113-2335761 FAX:0113-2335468 EMAIL:eric@scs.leeds.ac.uk
WWW: http://agora.leeds.ac.uk/scs/public/staff/eric.html
http://agora.leeds.ac.uk/ccalas/

===================================
++From: "J.L. Sancho, INSTITUTO DE LEXICOGRAFIA" <sancho@crea.rae.es>

Please find bellow a summary on Spanish (or language-independent)
taggers and beyond. Just in case.

Best,

Jose Luis Sancho

------------
Dear all:

A while back my colleague Maria Paula Santalla and I (Jose Luis
Sancho) posted an enquiry about corpus analysis resources for Spanish.
The following is a summary of what we have been referred to. We would
like to thank for their kind responses (order irrelevant): Max Louwerse,
Mike Scott, Carlos Subirats, Ken Litkowski, Jean V'eronis, Yorick Wilks,
Sandro Pedrazzini, John Aberdeen, Ana Mart'inez, Nuno Miguel Cavalheiro
Marques and Ken Beesley. This list exhausts our 'inbox'; therefore, we beg
anyone else who responded and is not mentioned above to forgive us (or our
server); In that case, retry, please. Note that the enquiry was posted in
various lists, hence information not necessarily coming from this list
may be quoted bellow. We apologize for any multiplicities.

- Max Louwerse (<M.M.Louwerse@stud.let.ruu.nl>) told us about the Qualrs-lst
on which a lot of tag-software has been discussed. As for software, he
mentioned NUDIST (Sage Publishers) and Notabene, whose homepage is

http://sls-www.lcs.mit.edu/~flammia/Nb.html and
ftp://sls-www.lcs.mit.edu/pub/flammia/Nb."

You can also email to Giovanni Flammia (flammia@mit.edu).

- Mike Scott (<ms2928@ac.uk>) suggested

http://www.liv.ac.uk/~ms2928/wordsmit.html

This accesses WordSmith Tools (Oxford Univ. Press 1996).

- Carlos Subirats (<lali1@uab.es>) pointed to a 'Etiquetador y
desambiguizador del espanol', developed by the Laboratorio de Linguistica
Informatica de la Universidad Autonoma de Barcelona. The address provided
is

Carlos Subirats Ruggeberg
Universidad Autonoma de Barcelona
Laboratorio de Linguistica Informatica
Edificio B
08193 Bellaterra, Spain

e-mail: c.subirats@oasis.uab.es
e-mail: c.subirats@cc.uab.es
Fax: (343)-581-16-86
Tel: (343)-581-22-29

- Ken Litkowski <71520.307@CompuServe.COM> directed us to some dictionary
utilities for creating and maintaining lexica. A description of this
software is available at

http://www.clres.com

- Jean V'eronis (<veronis@univ-aix.fr>) suggested a look at

http://www.lpl.univ-aix.fr/projects/multext/

and contacting Nuria Bel (nuria@gilcub.es).

- Yorick Wilks (<yorick@dcs.shef.ac.uk>) pointed to david@crl.nmsu.edu

- Sandro Pedrazzini (<sandro@idsia.ch>) pointed to a system with which you
can not only create and maintain lexica, but you can use it to generate
different forms of taggers, lemmatizers. A description of it can be found
at

http://www.ifi.unibas.ch/grudo/grudo.html
http://www.idsia.ch/wordmanager.html

- John Aberdeen (<aberdeen@mitre.org>) mentioned a fast part of speech
tagger, based on Eric Brill's notion of transformation based error driven
learning.

- Ana Mart'inez (<sysnet@bitmailer.net>) mentioned MABLe, a 'multilingual
letter authoring tool'.

- Nuno Miguel Cavalheiro Marques (<nmm@di.fct.unl.pt>) brought to our
attention two POS taggers, one using Viterbi tagging and HMM
and the other using Neural Networks. You can find a short review of
this work at

http://www-ia.di.fct.unl.pt/~nmm
http://www-ia.di.fct.unl.pt/~glint/Glint

There you can also access an article about POLARIS:a morphological lexical
acquisition and retrieval data base system. Contact with Gabriel Lopes
(gpl@fct.unl.pt) was also suggested.

- Ken Beesley (<Ken.Beesley@Grenoble.RXRC.Xerox.com>) noted that the Rank
Xerox Research Centre in Grenoble France has developed systems for
tokenization (word/term division) morphological analysis (for syntax, or,
less detailed, for tagging) part-of-speech "guesser" (for words not found
by the morphological analysis) tagging (based on an HMM tagger, trained on
a corpus) for Spanish. You can experiment with the morphological analysis
and tagger on

http://www.xerox.fr/grenoble/mltt/home.html

Thank you very much again. See you on the net

Jose Luis Sancho Maria Paula Santalla
sancho@crea.rae.es santalla@crea.rae.es

===================================
(end)

Next message: Jeff Adams: "Re: Corpus Linguistics Textbook"
Previous message: Cashen Louise: "Corpus Linguistics Textbook"