Re: Corpora: Summary of POS tagger evaluation

Ted E. Dunning (ted@aptex.com)
Tue, 9 Feb 1999 15:43:30 -0800 (PST)

I would like to underscore and amplify some of Phil's comments about
"a perplexity-like metric" for scoring POS taggers.

The basic issue is that most scoring systems cannot give partial
credit. It is, of course, best if a system can correctly that some
tag T is *definitely* the tag to be applied to word W. In the
abstract, however, it is better for a system to say that one of two
tags T_1 and T_2 apply to word W than for the system to say that one
of ten tags T_1 through T_10 should be applied. In some applications,
such a soft metric is of little relevance since the entire
architecture has been designed around hard choices, but if we are
developing POS taggers in isolation and if we are trying to assess the
progress made by a known imperfect system, then there is much to be
said for being able to assess partial credit as a milestone on the
road from ignorance to perfect knowledge.

There is, in fact, much to be said on this topic from a theoretical
point of view as well. If we assign partial credit correctly, then we
can have strong mathematical guarantees that our score will be
maximized if and only if we have extracted all possible deterministic
behavior from the problem under analysis and that we understand the
behavioral residue as well as is possible. The "perplexity-like"
score that Phil refers to is exactly the theoretically optimum method
for assigning partial credit.

There are reasonable arguments against using perplexity or a related
measure as a figure of merit for a POS tagger. Here are most of the
arguments against such a figure of merit that I hear most often
(stated from the point of view of the antagonist):

a) perplexity and related entropy measures depend on abtruse
mathematical arguments which I don't personally understand or trust.
I don't care how well a system guesses and I don't care if the system
always gives the correct answer as its second choice. I want the
tagger to be decisive and give me the right answer.

b) POS tagging is a completely deterministic process and there is no
residue of unpredictability. Thus, a hard and fast scoring system
with no partial credit will also give a maximum score to a perfect
tagger.

c) the system I am putting this tagger into requires hard tagging so
an evaluation which gives no partial credit makes sense in my
application.

My typical reply to (a) is that lack of familiarity with entropic
concepts is generally curable :-). Starting treatment early is
essential, however. I generally follow this argument with an offer to
discuss how entropic measures relate to common sense situations such
as betting on outcomes.

To (b), I simply repeat the Bayesian dogma that probability can either
express uncertainty in a physical sense (i.e. randomness) or
uncertainty in the sense of lack of certitude. To assign
probabilities to tags might be done to reflect the established fact
that we cannot get good agreement between different judges or even
with a single judge over time.

To (c) I reply that having a metric which gives us a continuous and
relatively smooth measure from the present to perfection allows us to
start hill climbing using simple techniques. Having a sharp-edged
metric which places discontinuities between us and our goal results
requires that we use the algorithmic equivalent of mountaineering
instead of simple hill climbing.

It is my experience that none of these arguments is particularly
effective at first, but that presenting examples of the benefits
taken from real data generally results in people using these
measures.