SUMMARY: Statistic for inter-group comparison on categorization tasks

Philip Resnik - Sun Microsystems Labs BOS (presnik@caesar.East.Sun.COM)
Tue, 18 Jun 1996 10:54:48 -0400

I got a number of responses to my query, and let me take a moment to
thank Jeff Adams, Becky Bruce, Jean Carletta, Ellen Hertz, Christer
Johansson, Stephen Johnson, Ted Pedersen, Henry Thompson, and Rich
Ulrich for taking the time to assist me and in some cases to discuss
the question at length.

Let me also take a moment to mention one of the most interesting
resources I was pointed to: a newsgroup called sci.stat.consult.

Recall, briefly, that my question concerned an experiment that could
be abstracted as having a set of patients, a set of diagnoses, a set
J1 of doctors, and a set J2 of nurses, where the doctors and nurses
all classified each patient into one diagnostic category and the goal
of the experiment was to assess the extent to which nurses' diagnoses
were similar to those of the doctors. (Note that the use of doctors
and nurses in my example bears no resemblance to the actual experiment
in question; a biostatistician on sci.stat.consult missed that point
and sent me a rather indignant message concerning hospitals' attempts
to reduce costs by having nurses do diagnoses...)

Several suggestions involved some variant on computing kappa for J1+J2
and comparing it with kappa for subsets of J1+J2, most notably J1 and
J2 separately; these included doing an ANOVA and starting with a
qualitative analysis according to the method of Krippendorff (cited in
Carletta's paper on the kappa statistic). Others involved some form
of testing for homogeneity, or somehow reducing the classification
tables to one dimension each and computing a correlation.

What I take to be the definitive response (thanks to Steve Johnson)
comes from Hripcsak, Friedman, et al., "Unlocking Clinical Data from
Narrative Reports: a Study of Natural Language Processing", Annals of
Internal Medicine, Vol 122, No 9, 1995. In this paper the authors
compare the performance of a group of doctors, a group of laypersons,
a group of simple keyword-based algorithms, and an NLP system, on the
task of detecting the presence of absence of six clinical conditions
based on the text in chest radiograph reports. The paper goes into
considerable detail, but the basic idea was to quantify intersubject
disagreement by a "distance" measure and then look for statistically
significant distances of individual judges from the average
intersubject distance among physicians. Notably, for the present
discussion, the experimenters also did the same analysis using kappa
rather than distance as the pairwise measure of interrater agreement,
and the results led to identical conclusions, with only the scale
changing. (Let me say as a side note that although this paper appears
in an out-of-the-way place for NLP researchers, it is well worth
reading; an *excellent* example of very thorough, very principled
evaluation methodology for an NLP system on a real-world task.)

Thanks again to all who responded.

Best,

Philip

----------------------------------------------------------------------------
Philip Resnik E-mail: philip.resnik@east.sun.com
Sun Microsystems Laboratories Work: (508) 442-0841
Two Elizabeth Drive Fax: (508) 250-5067
Chelmsford, MA 01824-4195 USA
----------------------------------------------------------------------------