Re: Corpora: Statistics in genre differences

Ken Litkowski (ken@clres.com)
Sat, 20 Mar 1999 16:25:10 -0500

The techniques used in Minnesota Contextual Content Analysis (MCCA)
provide a slightly different mechanism for identifying differences in
genre, sensitive down to the level of assessing consistency in Likert
scales. A dictionary was built, putting words into one of 120
categories. The Brown corpus was then scored using the emphasis of each
text on these categories; the results were analyzed with principal
components analysis to yield four major groups of language function
(analytic, practical, emotional, traditional). Based on the components
(contexts), a fresh batch of texts (sentences, interviews, open-ended
questions in questionnaires, papers, different characters in a play,
books) can be analyzed and scored on context and emphasis. The scoring
leads to 4-dimensional and 120-dimensional vectors characterizing the
text. The vectors are then further analyzed using multidimensional
scaling to identify the least stressful fit of the clustering.

Kristen Precht wrote:
[snip]
> But this begs questions on the relationship between statistical significance
> and language data. For example, if a particular text is significantly higher
> in hedges or emphatics, does that mean that the difference would be
> noticeable to a reader? Or conversely, it seems that a non-significant
> feature could still be quite noticeable to the reader. This is especially
>
> Of course I could field run an experiment on texts with different degrees of
> hedges, emphatics and such to rate reader sensitivity ... *sigh* ... but
> that would have to be done feature by feature, and may not adequately take
> into account the role of co-occurrence of features.
>

Our techniques would pick up the differences in emphasis and do take
into account co-occurrences. The difficulty is that one needs to
examine the results (graphs) and then use the other statistics that are
generated to go back and find out precisely what gave rise to the
differences.

> Has anyone else come across literature, or had thoughts on the role of
> statistics in making comparisons between genres, or any other corpus
> comparisons? I have often seen assumptions that significant difference can
> be used to 'categorize' genres or corpora, but I'm just not comfortable with
> that yet. I've been struggling with this question for a while and am not
> happy with the options I've come up.
>

I ran MCCA against Adam Kilgarriff's `gold standard' Known-Similarity
Corpora to good effect and even was able to question the presumed
similarity of some of his textual materials (e.g., even though a set of
texts might have been drawn from the Guardian, they could have been
drawn from different genre's such as movie revues, gardening tips, and
straight news).

-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken@clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com