Corpora: Summary of POS tagger evaluation

Yen Ketty (yen_ketty@bah.com)
Mon, 08 Feb 1999 22:05:52 -0500

Dear netters,

The following is the summary of POS tagger evaluation.
Thank you for those who replied. Most of the replies
referred me to papers. I also include recommendations and
a past summary related to my query.

Ketty Gann

ORIGINAL QUERY
________________

Dear netters,

I'm now evaluating several POS taggers for English and some other
European languages.

I'd like to know if there are any guidelines or approaches to evaluate
the performance of a POS tagger, and comparsions among POS taggers,
such as accuracy.

Any recommendations are welcome. I'll post a summary later.

Thank you in advance,
Ketty Gann
Language Technology Manager
Booz.Allen & Hamilton Inc.
Linthicum, MD

PAPERS AND BOOKS:
_________________

"Syntactic Wordclass Tagging" Hans van Halteren (ed.) Kluwer Academic
Publishers (forthcoming in 1999).

Hans van Halteren, Jakub Zavrel and Walter Daelemans in Procs. 1998
COLING/ACL.

LREC conference proceedings in Granada Spain (1998).

Gilles Adda, Joseph Mariani, Josette Lecomte, Patrick Paroubek and
Martin Rajman, 1998, "The GRACE French Part-Of-Speech Tagging Evaluation
Task", First International Conference on Language Resources and
Evaluation, LREC'98", "Granada",433-441.

Jan Haji\v{c} and Barbora Hladka,1998. "Czech Language Processing / POS
Tagging",First International Conference on Language Resources and
Evaluation, LREC'98", (eds.) Antonio Rubio and Natividad Gallardo and
Rosa Castro and Antonio Tejada Granada,1998,ELRA.

Padro & Marquez, 1998 "On the evaluation and comparison of taggers: the
effect of noise in test corpora, In Procs. of COLING/ACL 98. Montreal.
Canada

Christer Samuelsson & Atro Voutilainen, 1997, "Comparing a Linguistic
and a Stochastic Tagger" in Proc of
ACL97.(http://www.conexor.fi/e-acl97/e-acl97.html,
The tagger itself is at www.conexor.fi/analysers.html#testing)

Leidner Jochen,1997, "Evluation Tagger for English:Some Evidence",
Technical Report-971101, Friedrich-Alexander Universitaet
Erlangan-Nuemberg

Jean-Pierre Chanod and Pasi Tapanainen, 1995, "Tagging French -
comparing a statistical and a constraint-based method". In the
proceedings of the Seventh Conference of the European Chapter of the
Association for Computational Linguistics(EACL'95). pp.
149-156.(http://xxx.lanl.gov/abs/cmp-lg/9503003)

Atro Voutilainen & Timo Jarvinen 1995,"Specifying a shallow grammatical
representation for parsing purposes" in Proc. of EACL95,
(http://xxx.lanl.gov/abs/cmp-lg/9502011)

D. Elworthy, 1995 "Tagset Design and Inflected Languages",Proceedings of
the ACL SIGDAT workshop From Text to Tags Issues in Multilingual
Language Analysis",Dublin,1995

Pasi Tapanainen and Atro Voutilainen. "Tagging accurately - Don't guess
if you know." In the proceedings of the 4th Conference on Applied
Natural Language Processing (ANLP'94). pages 47-52. Association for
Computational Linguistics, Stuttgart, 1994.
(http://xxx.lanl.gov/abs/cmp-lg/9408009)

EVALUTAIONS
____________
Evaluation of French Taggers
"GRACE 1" ( http://m17.limsi.fr/TLP/grace/ POC: Patrick Paroubek
<pap@ciril.fr>)

GUIDELINES
____________
EAGLES (http://www.ilc.cnt.it/EAGLES96/home.html)
Anne Schiller (Anne.Schiller@xrce.xerox.com) and Simone Teufel
(simone@cogsci.ed.ac.uk)have been working on EAGLES guidelines.

DOWNLOADABLE TAGGERS & BENCHMARK
____________________
Statistical Natural Language Processings and Corpus-bassed Computational
Linguistics: An Annotated List of Resources
(http://www.sultry.arts.su.edu.au/links/statnlp.html)

German tagger
www.ims.uni-stuttgart.de/~lezius/mosetup.exe
POC: Wolfgang Lezius at IMS, University of Stuttgart
(lezius@ims.uni-stuttgart.de or wolfgang@lezius.de)
Tel.: +49 +711 121-1374
Fax: +49 +711 121-1366

Benchmark
Alembic Workbench annotation tool distribution
(http://www.mitre.org/technology/alembic-workbench/).
Contact John Aberdeen at Mitre Corporation (aberdeen@mitre.org)
http://www.mitre.org/technology/nlp/
voice +1.781.271.2840
fax +1.781.271.2352

OTHER REPLIES
_______________
Fanny MEUNIER <meunier@mait.ucl.ac.be> wrote:
this will probably sound like a very trivial remark to you (but I've
seenso many studies which did not even mention the issue that I think my
reply is worth sending anyway). The degree of 'delicacy' of the tagset
should always be taken into account: the number of main category tags,
of subcategories and features should always be clearly referred to
because it influences further research.
Refined tagsets allow refined searches.

Jakub Zavrel (zavrel@kub.nl) wrote:
I saw your call for criteria for the evaluation of POS taggers on the
corpora list. A few I can think of are all informal. Although I know
there has been an formal evaluation project in France under the
direction of someone called Paroubek. A few important features in my
opinion would be:
* accuracy
* trainability
* speed
* text normalization (=tokenization)
* size/granularity of tagset
* possibility to increase the lexicon
I'm sure this is not much new. I'll be conducting a survey on taggers
for Dutch next month, so if you have found any known evaluation
guidelines, I would be glad if you could let me know.
------------------------------------------------------------------------------
Jakub Zavrel, B 330, Tilburg University, POBox 90153, 5000 LE Tilburg,
NL
http://ilk.kub.nl/~zavrel/ tel/fax: +31-13-4663163/3110
------------------------------------------------------------------------------

Ji Donghong wrote:

The evaluation of POS taggers may concern the well-formedness of the
concept, POS. If a POS sysntem for a specific language is well-formed,
i.e., based on some definite and objective criteria, its evaluation will
also be well-formed, and easier to perform. Otherwise, it is somwhat
difficult to justify the tagger. The following is a sum of two questions
about POS, it may be helpful.

--------------------------------------------
Kent Ridge Digital Labs
21 Heng Mui Keng Terrace
Singapore, 119613
Email: dhji@krdl.org.sg
Tel: 65-8746380
Fax: 65-7744998

Dear colleagues,

Some time ago, I posed two queries (section 1 in the following sum)
about part-of-speech based on syntactic distribution. I am very thankful
for the researchers listed in section 2, who replied to the queries. The
typical answers are listed in section 3. Some references they mentioned
are listed in section 4. In addition, I present my personal conclusion
about the problem in section 5 just for your information. In order to
make the researchers who are not familiar with Chinese understand more
clearly about my posing the queries, I list one open question, i.e., the
first question in section 6. The other question in section 6 may also be
interesting.

Thank you very much.

With best regards,

Ji Donghong

--------------------------------------------
Kent Ridge Digital Labs
21 Heng Mui Keng Terrace
Singapore, 119613
Email: dhji@krdl.org.sg
Tel: 65-8746380
Fax: 65-7744998
--------------------------------------------

SUM: WHAT'S BEHIND PART-OF-SPECCH?

1. QUERIES

Query A:

In Chinese, there are fewer affixes for us to classify words into
categories, e.g., nouns, verbs or adjectives, etc., so even up to now,
there has been no information about POS for Chinese words in the most
famous Chinese dictionary, i.e., Modern Chinese Dictionary.
Some linguists proposed that Chinese words be classified as nouns, verbs
and adjectives, etc. completely based on their grammatical distribution,
which they referred to as their ability to combine with other words.

My questions are:

1) Can such grammatical distribution be solely used as a means to
determine POS of words?

2) Are there any similar problems in other languages? How to solve the
problem there?

Query B:

Several days ago, I posed a query "what's behind part-of-speech?", up to
now, more than 10 researchers have replied me. Now I would like to pose
another query on the topic before presenting a summarisation:

Q: Is the part-of-speech based on syntactic distribution a
WELL-FORMED concept?

Any comments or information will be highly appreciated.

2. ACKNOWLEDGEMENTS

Adam Kilgarriff
Geoffrey Sampson
Marcia Haag
Philip Resnik
Sun Honglin
Joseph Davis
Christopher Hogan
Frantisek Cermak
Waruno Mahdi
Atro Voutilainen
Rob Freeman
V'ctor V‡zquez Mart'nez
Bingfu Lu
Alex Murzaku
Alexis Manaster Ramer
Lua Kim Teng
Earl Herrick
Xu Jie
Guo Jin
Dan Maxwell
Elaine Jones
Anne-Line Graedler
Steven Schaufele
Robin Sackmann

3. ANSWERS

1) Some doubted whether categories such as N, V, ADJ etc. are good
analytic categories for Chinese language, and that they may be
inappropriate imports from the West.

2) Some pointed that grammatical distribution or functions are the
standard, or primary way to classify POS. The reason mentioned include
that the definition is clear and useful, or at least more so than
alternatives. Some others proposed that syntactic valency be used to
define POS among all syntactic means.

3) Some argued that grammatical distribution should not be used to
determine lexical categories. The reasons mentioned include that there
are predicate nouns, attributive verbs, sentential subjects, etc.

4) Some pointed that it is hardly surprising that grammarians have had
trouble classifying Chinese words into parts of speech. The reason is
the notion of "part-of-speech" is fraught with difficulties in
linguistics, to the extent that many western linguists since 1900 have
abandoned it altogether (though Chomsky did explicitly reintroduce the
ancient notion in 1957 in his generative grammar).

5) Some replied the queries indirectly, pointing that the fact that POS
disambiguation can be done on the basis of linguistically motivated
contextual rules suggests that parts of speech are syntactically
motivated or syntactically definable).

6) Some pointed that POS is not a particularly well-formed concept, not
in the sense that you can define universally accepted unambiguous
classes, no labelling will be objective and absolute, even the classical
interpretations are uncertain. The reasons mentioned include that when
you assign POS, you are partitioning a continuum of association
behaviour. Further, they held that for language processing systems, POS
is a misleading concept, and that we are better off thinking about the
continuous reality of syntactic associativity, rather than trying to
label it and pretend it is discrete.

7) Some pointed that ultimate criterion for POS should be meaning. The
reasons mentioned include that although syntactic features are very
limited, the combination of these features is, if not infinite, a huge
amount.

8) Some pointed that outside of phonetics perhaps, there seems to be no
concept in linguistics which is well-defined enough so given a language
we can mechanically identify instances of that concept. They also
pointed of languages, producing a term which refers to fairly (though
not always precisely) well-defined set of entities in that (those)
lg(s), and then the same person or more likely others trying to use the
same term for entities in some other language(s) which SEEM to have
something in common with those in the original language(s).

9) Some pointed that POS may be taken somewhat for granted by the
linguistics community, linguists come to the task of defining POS with a
en et en siamois", _Bulletin de la Societe de Linguistique de Paris_
46:183--196.

Trnka, Bohumil, 1966, "On the Basic Categories of Syntagmatic
Morphology", _Traveaux Linguistiques de Prague_ 2:165-169.

Mahdi, Waruno, 1993, "Distinguishing Homonymic Word Forms in
Indonesian", pp. 181-218 in Ger P. Reesink (ed.) _Topics in Descriptive
Austronesian inguistics_, Semaian 11. Leiden: Vakgroep Talen en Culturen
van ZO Asien en Oceanie.

Rygaloff, A., 1958, "La classe nominale en chinois:
determine/indetermine", Bulletin de la Societe de Linguistique de Paris_
53:306-315.

Hinrich Shutze, "Dimensions of Meaning"

Chu, Fa-Kao; "Word classes in classical Chinese"; in Proceedings of the
IXth Congress of linguistics; The Hague 196, p. 594.

Hagege,Claude; "Le probleme linguistique de prepositions et la solution
chinoise"; Louvain, Peeters, 1975.

Sasse, Hans-Jurgen; "Syntactic categories and sub-categories"in J.
Jacobs et al.; "Syntax. Ein internationales Handbuch der zeitgenossicher
Forschung", Walter de Gruyter, Berlin, 1994.

1995 On the subject of Malagasy imperatives. Oceanic Linguistics 34:
203-210.

1994 On the origin of the term 'ergative'. Sprachtypologie und
Universalienforschung 47(3): 207-210.

1993 Malagasy and the subject/topic issue. Oceanic Linguistics 31:
267-279.

1992 On intensional vs. extensional grammatical categories. Papers from
the Annual Meeting of the Southeast Asian Linguistics Society (ed. Karen
L. Adams and Thomas John Hudak), 201-212. Tempe, AZ: Arizona State
University Program for Southeast Asian Studies.

What's a topic in the Philippines? Papers from the First Annual Meeting
of the Southeast Asian Linguistics Society (ed. Martha Ratliff and Eric
Schiller), 271-291. Arizona State University Program for Southeast Asian
Studies Monograph Series.

1988 What about Lisu? Languages of the Tibeto-Burman Area 11(2):
133-143.

Karen L. Adams and AMR. Some questions of topic/focus choice in Tagalog.
Oceanic Linguistics 27: 79-101.

James D. McCawley's 1992 paper "Justifying Part-of-Speech Assignments in
Mandarin Chinese", Journal of Chinese Linguistics_ vol 20, no. 2, pp.
211-245.

Sadock (1990) "Parts of speech in Autolexical Syntax", in McCawley
(1988) The Syntactic Phenomena of English.

Vonen, Arnfinn Muruvik. 1997. Parts of Speech and Linguistic Typology.
Open Classes and Conversion in Russian and Tokelau. (Acta Humaniora No.
22). Oslo: Universitetsforlaget. (ISBN 82-00-12685-4)

Sackmann, Robin, 1996, The problem of "adjectives" in Mandarin Chinese,
in Sackmann, Robin (ed.) Theoretical linguistics and grammatical
description. Amsterdam etc.: John Benjamins Publishing Co. p.257-275.

5. PERSONAL CONCLUSION

My personal conclusion is that POS based on syntactic distribution is
not a well-formed concept. The reasons are that:

1) Non-operable.

For a word of a given language, what is its syntactic distribution? It
seems that there is no clear definition. The most natural modelling for
the syntactic distribution of a word may be the context in which the
word can occur, however we cannot list all in any sense.

2) Non-deterministic:

Even if we can select, based on whatever reasons, a definite set of
distributional evidences, e.g., contexts, functions or co-occurrences,
as criteria to define the POS system for a language, there should exist
many many classes, and many many classifications for the whole word set.
It seems that we don't have any reasonable reason to choose a particular
classification among all as the POS system for the considered language.
3) Non-provable or non-justifiable:

Even if we can select a particular classification as the POS system
based on whatever reasons, it seems that there is no sense in which we
can say that the selected POS system is correct or incorrect. The deeper
reason for this problem may be that distributional theories about POS
don't care about WHAT (is the part of speech, e.g., nouns, verbs, etc.
of a language?), only care about HOW (to construct a POS system for a
language?), or at least they equalise WHAT and HOW and don't care about
the distinction between them. Thus it may be difficult for us to justify
a POS system for a language, or compare different POS systems for a
language in a significant sense.

6. OPEN QUESTIONS

1) Suppose that we are given a language, which is just like English,
however without any affixes, e.g., -ment, -ing, -ed, -tion, -sion, etc.,
So the following are all possible phrases in the language: make develop;
develop country; develop product, etc. Now the problem is: How to
determine the distribution-based POS system for the language? (The case
is roughly like that in Chinese.)

2) If POS based on distribution is not well-formed, what possible
influences can the non-well-formedness have on the syntactic theories
built based on POS?

PAST SUMMARIES
________________
Thanks to all who supplied information on the evaluation of taggers.
Here is a summary of the replies, and some comments, from Andrew Harley:

Last year, we carried out a test on 4 taggers: the Prospero "Parser"
(telephone Mike Oakes on 0181-741-8531 for details), one from John
Carroll
at Brighton University, an old ACQUILEX tagger written by David Elworthy
at
Cambridge University, and our internal sense tagger. No ambiguous or
unknown tags were permitted, punctuation tags were certainly not counted
(unlike some other scores given in the literature!), and we had strict
rules about coding participles as attributive adjectives if that was the
function they were performing in the sentence. This is rather unfair on
the
taggers but reflected the results that we wanted for our corpora. The
accuracy rates on a 4000 word sample were low, ranging from 87% to 90%
(for
approximately 50 tags), Prospero coming out top.

Jochen Leidner <leidner@linguistik.uni-erlangen.de> considered this a
serious issue, and provided lots of helpful information. Being unaware
of
systematic studies in the field, he himself set out on undertaking just
such an analysis. The technical report that contains the tagged data is
available from
<URL:ftp://ftp.linguistik.uni-erlangen.de/pub/reports/CLUE-TR-971101.ps.gz>
and the data files from
<URL:ftp://ftp.linguistik.uni-erlangen.de/pub/reports/CLUE-TR-971101-data.ta
r.gz>

Philip Bralich <bralich@hawaii.edu> agreed there were very few studies,
suggesting only the MUC conferences at
<http://cs.nyu.edu/cs/faculty/grishman/muc6.html>

Eric Atwell <eric@scs.leeds.ac.uk> has *nearly* finished a paper
comparing
accuracy rates etc, for submission to Computer Speech and Language
(special
issue on evaluation). His gut feeling is that there's little difference
in
accuracy, most work about 90-95% depending on tagset, language genre,
and
other
application-dependent factors. He recommends not his tagger but (i) the
English Constraint Grammar tagger/semiparser at Helsinki, which in
addition
to PoS categories marks subject, object, and some dependency relations;
and
(ii) Alex Fang's AUTASYS tagger and ICE parser, which adds PoS tags and
full parse-trees according to ICE markup scheme. However, this isn't
really based on "official tests", just personal assessments...

Klas Prytz <klas.prytz@ling.uu.se> has done some evaluation of the
ENGlish
Constraint Grammar (ENGCG) and the recall seems quite high but precision
is
much lower. No official paper yet.

Djoerd Hiemstra <hiemstra@cs.utwente.nl> reported that Martin Rajman
<rajman@lia.di.epfl.ch> of EPFL (Swiss Federal Institute of Technology
in
Lausanne, Switzerland) is working on a large scale comparison of taggers
and parser for POS-tagging, which he thinks is to be published next
January.

Leidner also comments that the topic of evaluating the accuracy of
taggers
and parsers is very difficult, because there is a lot of diversity wrt
tagset size (some tagsets are rather crude, others include
subcategorization information or even semantic subclasses), so n%
correctness using tagset A is perhaps still worse than (n-1)%
correctness
using a more detailed tagset B. The AMALGAM project at ULeeds is
concerned
with mapping different annotation models
<http://www.scs.leeds.ac.uk/amalgam/amalgam/amalgsoft.html>.

The question of speed is usually not properly addressed in the
literature
because in most cases no detailed information about the hardware is
given
(specINT95, memory size, user mode, ...). Dimitrios Kokkinakis
<svedk@svenska.gu.se> reported that Cooke's semanTag on Swedish is 9
times
faster than the Brill tagger.

SOME WEB POINTERS (Jochen Leidner)
=================
At http://www.ling.helsinki.fi/~avoutila/cg/index.html you can test
EngCG-2, IMHO a high-quality, rule-based parser (by Lingsoft).
The BRILL-TAGGER is available via FTP at
blaze.cs.jhu.edu/pub/brill/Programs
The XEROX-TAGGER is available via anonymous FTP at
parcftp.xerox.com:/pub/tagger/
For morphological analysis, you can download either PC-KIMMO 2
from ftp://ftp.sil.org/software/unix/ or Malaga from
http://www.linguistik.uni-erlangen.de/Malaga.en.html (both without
ling. descriptions).
For info on the "AD ENGLISH LEMMATIZER" contact Bruno Maximilian
Schulze (IMS Stuttgart) <schulze@ims.uni-stuttgart.de>
The ENGTWOL Tagger and lemmatizer can be also bought from Lingsoft,
Helsinki.

EVALUATION OF MORPHOLOGICAL ANALYZERS, TAGGERS AND PARSERS (Jochen
Leidner)
==========================================================
Tapanainen, Pasi and Atro Voutilainen, "Tagging accurately - Don't
guess if you know" In the proceedings of the Fourth Conference on
Applied Natural Language Processing (ANLP'94). pp.47-52.
Stuttgart, Germany, 1994.
Samuelsson, Christer and Atro Voutilainen, "Comparing a Linguistic and
a Stochastic Tagger." In Proceedings of the 35th Annual Meeting of
the Association for Computational Linguistics, pp. 246-253, ACL,
1997. [Also available online as cmp-lg/9706005.]
E. Black, S. Abney, D. Flickenger, C. Gdaniec, R. Grishman,
P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans,
M. Liberman, M. Marcus, S. Roukos, B. Santorini, and T.
Strzalkowski. A procedure for quantitatively comparing the syntactic
coverage of English grammars. In Defense Advanced Research Projects
Agency: Proceedings of the Fourth DARPA Speech
and Natural Language Workshop, Pacific Grove, California, February
1991.
Morgan Kaufmann.
P. Harrison, S. Abney, E. Black, D. Flickenger, C. Gdaniec,
R. Grishman, D. Hindle, R. Ingria, M. Marcus, B. Santorini, and
T. Strzalkowski. Evaluating syntax performance of parser/grammars of
English. In Proceedings of the Workshop On Evaluating Natural
Language Processing Systems. Association For Computational
Linguistics, 1991.
Hausser, Roland (ed.): The coordinator's final report on the first
Morpholympics. LDV-Forum, 11(1):54--64, 1994. available via
<mailto:rrh@linguistik.uni-erlangen.de>
Cole, Ronald A. (ed.): Survey of the State of the Art in Human
Language Technology, Chapter 13, e.g. at

http://www.kgw.tu-berlin.de/~mengel/SpeechTech/ch13node6.html#SECTION134
Karlsson, F. et al (eds) (1995): Constraint Grammar, esp p269-83, pp359

Andrew Harley
Systems Manager - ELT Reference
Cambridge University Press

Direct line: (01223)325880