Re: Corpora: Summary of POS tagger evaluation

Thorsten Brants (thorsten@CoLi.Uni-SB.DE)
Tue, 9 Feb 1999 17:44:38 +0100 (MET) wrote:
> This raises an issue which is slightly more complex: if you exclude
> punctuation (presumably on the grounds that a comma is always tagged
> as `comma' and there is no ambiguity), why include other unambiguous
> tokens in the scoring? If `the' always gets assigned `DET', and no
> other tags for it are possible, then why count it and not the comma?

one reason for _not_ excluding unambiguous words is sparse data: how do
you know that a word is unambiguous? Just that is has only one tag in
the lexicon is not sufficient because the correct tag may not be listed.

If you exclude unambiguous words from scoring, you really would need two
different accuracy results in order to describe the performance of a
tagger: one for ambiguous words, the other one for ``unambiguous''