Corpora: Summary of POS tagger evaluation

Dear netters,

The following is the summary of POS tagger evaluation.
Thank you for those who replied. Most of the replies
referred me to papers. I also include recommendations and
a past summary related to my query.

Dear netters,

I'm now evaluating several POS taggers for English and some other
European languages.

I'd like to know if there are any guidelines or approaches to evaluate
the performance of a POS tagger, and comparsions among POS taggers,
such as accuracy.

Any recommendations are welcome. I'll post a summary later.

Evaluation of French Taggers
"GRACE 1"
POC: Patrick Paroubek

Anne Schiller ( and Simone Teufel
( been working on EAGLES guidelines.

Statistical Natural Language Processings and Corpus-bassed Computational
Linguistics: An Annotated List of Resources

German tagger
POC: Wolfgang Lezius at IMS, University of Stuttgart
( or
Tel.: +49 +711 121-1374
Fax: +49 +711 121-1366

Alembic Workbench annotation tool distribution
Contact John Aberdeen at Mitre Corporation (
voice +1.781.271.2840
fax +1.781.271.2352

Fanny MEUNIER <> wrote:
this will probably sound like a very trivial remark to you (but I've
seenso many studies which did not even mention the issue that I think my
reply is worth sending anyway). The degree of 'delicacy' of the tagset
should always be taken into account: the number of main category tags,
of subcategories and features should always be clearly referred to
because it influences further research.
Refined tagsets allow refined searches.

Jakub Zavrel ( wrote:
I saw your call for criteria for the evaluation of POS taggers on the
corpora list. A few I can think of are all informal. Although I know
there has been an formal evaluation project in France under the
direction of someone called Paroubek. A few important features in my
opinion would be:
* accuracy
* trainability
* speed
* text normalization (=tokenization)
* size/granularity of tagset
* possibility to increase the lexicon
I'm sure this is not much new. I'll be conducting a survey on taggers
for Dutch next month, so if you have found any known evaluation
guidelines, I would be glad if you could let me know.
Ji Donghong wrote:

The evaluation of POS taggers may concern the well-formedness of the
concept, POS. If a POS sysntem for a specific language is well-formed,
i.e., based on some definite and objective criteria, its evaluation will
also be well-formed, and easier to perform. Otherwise, it is somwhat
difficult to justify the tagger. The following is a sum of two questions
about POS, it may be helpful.

Dear colleagues,

Some time ago, I posed two queries (section 1 in the following sum)
about part-of-speech based on syntactic distribution. I am very thankful
for the researchers listed in section 2, who replied to the queries. The
typical answers are listed in section 3. Some references they mentioned
are listed in section 4. In addition, I present my personal conclusion
about the problem in section 5 just for your information. In order to
make the researchers who are not familiar with Chinese understand more
clearly about my posing the queries, I list one open question, i.e., the
first question in section 6. The other question in section 6 may also be

Thank you very much.

With best regards,

Ji Donghong

Query A:

In Chinese, there are fewer affixes for us to classify words into
categories, e.g., nouns, verbs or adjectives, etc., so even up to now,
there has been no information about POS for Chinese words in the most
famous Chinese dictionary, i.e., Modern Chinese Dictionary.
Some linguists proposed that Chinese words be classified as nouns, verbs
and adjectives, etc. completely based on their grammatical distribution,
which they referred to as their ability to combine with other words.

My questions are:

1) Can such grammatical distribution be solely used as a means to
determine POS of words?

2) Are there any similar problems in other languages? How to solve the
problem there?

Query B:

Several days ago, I posed a query "what's behind part-of-speech?", up to
now, more than 10 researchers have replied me. Now I would like to pose
another query on the topic before presenting a summarisation:

Q: Is the part-of-speech based on syntactic distribution a
WELL-FORMED concept?

Any comments or information will be highly appreciated.


1) Some doubted whether categories such as N, V, ADJ etc. are good
analytic categories for Chinese language, and that they may be
inappropriate imports from the West.

2) Some pointed that grammatical distribution or functions are the
standard, or primary way to classify POS. The reason mentioned include
that the definition is clear and useful, or at least more so than
alternatives. Some others proposed that syntactic valency be used to
define POS among all syntactic means.

3) Some argued that grammatical distribution should not be used to
determine lexical categories. The reasons mentioned include that there
are predicate nouns, attributive verbs, sentential subjects, etc.

4) Some pointed that it is hardly surprising that grammarians have had
trouble classifying Chinese words into parts of speech. The reason is
the notion of "part-of-speech" is fraught with difficulties in
linguistics, to the extent that many western linguists since 1900 have
abandoned it altogether (though Chomsky did explicitly reintroduce the
ancient notion in 1957 in his generative grammar).

5) Some replied the queries indirectly, pointing that the fact that POS
disambiguation can be done on the basis of linguistically motivated
contextual rules suggests that parts of speech are syntactically
motivated or syntactically definable).

6) Some pointed that POS is not a particularly well-formed concept, not
in the sense that you can define universally accepted unambiguous
classes, no labelling will be objective and absolute, even the classical
interpretations are uncertain. The reasons mentioned include that when
you assign POS, you are partitioning a continuum of association
behaviour. Further, they held that for language processing systems, POS
is a misleading concept, and that we are better off thinking about the
continuous reality of syntactic associativity, rather than trying to
label it and pretend it is discrete.

7) Some pointed that ultimate criterion for POS should be meaning. The
reasons mentioned include that although syntactic features are very
limited, the combination of these features is, if not infinite, a huge

8) Some pointed that outside of phonetics perhaps, there seems to be no
concept in linguistics which is well-defined enough so given a language
we can mechanically identify instances of that concept. They also
pointed of languages, producing a term which refers to fairly (though
not always precisely) well-defined set of entities in that (those)
lg(s), and then the same person or more likely others trying to use the
same term for entities in some other language(s) which SEEM to have
something in common with those in the original language(s).

9) Some pointed that POS may be taken somewhat for granted by the
linguistics community, linguists come to the task of defining POS with a
My personal conclusion is that POS based on syntactic distribution is
not a well-formed concept. The reasons are that:

1) Non-operable.

For a word of a given language, what is its syntactic distribution? It
seems that there is no clear definition. The most natural modelling for
the syntactic distribution of a word may be the context in which the
word can occur, however we cannot list all in any sense.

2) Non-deterministic:

Even if we can select, based on whatever reasons, a definite set of
distributional evidences, e.g., contexts, functions or co-occurrences,
as criteria to define the POS system for a language, there should exist
many many classes, and many many classifications for the whole word set.
It seems that we don't have any reasonable reason to choose a particular
classification among all as the POS system for the considered language.
3) Non-provable or non-justifiable:

Even if we can select a particular classification as the POS system
based on whatever reasons, it seems that there is no sense in which we
can say that the selected POS system is correct or incorrect. The deeper
reason for this problem may be that distributional theories about POS
don't care about WHAT (is the part of speech, e.g., nouns, verbs, etc.
of a language?), only care about HOW (to construct a POS system for a
language?), or at least they equalise WHAT and HOW and don't care about
the distinction between them. Thus it may be difficult for us to justify
a POS system for a language, or compare different POS systems for a
language in a significant sense.


1) Suppose that we are given a language, which is just like English,
however without any affixes, e.g., -ment, -ing, -ed, -tion, -sion, etc.,
So the following are all possible phrases in the language: make develop;
develop country; develop product, etc. Now the problem is: How to
determine the distribution-based POS system for the language? (The case
is roughly like that in Chinese.)

2) If POS based on distribution is not well-formed, what possible
influences can the non-well-formedness have on the syntactic theories
built based on POS?

Thanks to all who supplied information on the evaluation of taggers.
Here is a summary of the replies, and some comments, from Andrew Harley:

Last year, we carried out a test on 4 taggers: the Prospero "Parser"
(telephone Mike Oakes on 0181-741-8531 for details), one from John
at Brighton University, an old ACQUILEX tagger written by David Elworthy
Cambridge University, and our internal sense tagger. No ambiguous or
unknown tags were permitted, punctuation tags were certainly not counted
(unlike some other scores given in the literature!), and we had strict
rules about coding participles as attributive adjectives if that was the
function they were performing in the sentence. This is rather unfair on
taggers but reflected the results that we wanted for our corpora. The
accuracy rates on a 4000 word sample were low, ranging from 87% to 90%
approximately 50 tags), Prospero coming out top.

Jochen Leidner <> considered this a
serious issue, and provided lots of helpful information. Being unaware
systematic studies in the field, he himself set out on undertaking just
such an analysis. The technical report that contains the tagged data is
available from
and the data files from

Philip Bralich <> agreed there were very few studies,
suggesting only the MUC conferences at

Eric Atwell <> has *nearly* finished a paper
accuracy rates etc, for submission to Computer Speech and Language
issue on evaluation). His gut feeling is that there's little difference
accuracy, most work about 90-95% depending on tagset, language genre,
application-dependent factors. He recommends not his tagger but (i) the
English Constraint Grammar tagger/semiparser at Helsinki, which in
to PoS categories marks subject, object, and some dependency relations;
(ii) Alex Fang's AUTASYS tagger and ICE parser, which adds PoS tags and
full parse-trees according to ICE markup scheme. However, this isn't
really based on "official tests", just personal assessments...

Klas Prytz <> has done some evaluation of the
Constraint Grammar (ENGCG) and the recall seems quite high but precision
much lower. No official paper yet.

Djoerd Hiemstra <> reported that Martin Rajman
<> of EPFL (Swiss Federal Institute of Technology
Lausanne, Switzerland) is working on a large scale comparison of taggers
and parser for POS-tagging, which he thinks is to be published next

Leidner also comments that the topic of evaluating the accuracy of
and parsers is very difficult, because there is a lot of diversity wrt
tagset size (some tagsets are rather crude, others include
subcategorization information or even semantic subclasses), so n%
correctness using tagset A is perhaps still worse than (n-1)%
using a more detailed tagset B. The AMALGAM project at ULeeds is
with mapping different annotation models

The question of speed is usually not properly addressed in the
because in most cases no detailed information about the hardware is
(specINT95, memory size, user mode, ...). Dimitrios Kokkinakis
<> reported that Cooke's semanTag on Swedish is 9
faster than the Brill tagger.

SOME WEB POINTERS (Jochen Leidner)
At you can test
EngCG-2, IMHO a high-quality, rule-based parser (by Lingsoft).
The BRILL-TAGGER is available via FTP at
The XEROX-TAGGER is available via anonymous FTP at
For morphological analysis, you can download either PC-KIMMO 2
from or Malaga from (both without
ling. descriptions).
For info on the "AD ENGLISH LEMMATIZER" contact Bruno Maximilian
Schulze (IMS Stuttgart) <>
The ENGTWOL Tagger and lemmatizer can be also bought from Lingsoft,

Andrew Harley
Systems Manager - ELT Reference
Cambridge University Press

Direct line: (01223)325880