Re: Corpora: Statistics in genre differences

Tony Berber Sardinha (tony4@uol.com.br)
Sun, 21 Mar 1999 00:09:32 -0300

Kristen,

You raise very important issues; I'd just like to make a few general
comments.

I think part of the problem with these frequency counts is that they
generally do not take into account where features occur in the text. In
other words, it is not just genre salience (frequency across texts) that
matters but also textual salience (use in text), and features become more
or less 'text salient' depending, for instance, on whether they occur in
thematic or rhematic position, paragraph-initial or final position, as
markers of section or rhetorical boundaries, etc. For the reader /
listener, then,
there is a difference between occurrence and use. Simply norming the counts
to 500, 1000 words, etc won't take account of that, because these units
(500-word/ 1000-word stretches, etc) are "analyst's" units, not 'reader'
units, that is they don't refer to how texts are processed. What would be
needed to account for 'text saliency' would perhaps be to base the counts
on some sort of linguistically valid (?) unit such as the clause, the
sentence,
paragraph, or rhetorical division. So, for instance, instead of taking the
frequencies for each individual text or genre (and then normalizing those
counts), you could take the frequency for each feature per clause / T-unit
/ paragraph etc, normalize the counts, and so on.

> frequencies, it's hard to assume that the reader would notice the
difference
> between 2 per thousand words and 5 per thousand words.

Perhaps they would if that 3 word difference meant those words occurred in
the same clause, paragraph, communicative / rhetorical segment, etc.

cheers

tony.
-------------------------------
Dr Tony Berber Sardinha
Catholic University of Sao Paulo, Brazil
tony4@uol.com.br
http://sites.uol.com.br/tony4/homepage.html
http://homepages.infoseek.com/~corpuslinguistics/homepage.html
-------------------------------

----------
> From: Kristen Precht <kprecht@ruby.iupui.edu>
> To: CORPORA@uib.no
> Subject: Corpora: Statistics in genre differences
> Date: 19 March 1999 18:11
>
> I have had a similar quandry on interpreting with statistics when
comparing
> differences in PoS tags in genres. I use Doug Biber's tags, and am
> interested in comparing features (PoS or other identified textual
features,
> such as tagged metalanguage markers) across genres or within a genre as
> written by L1 and L2 speakers of English. I have been running ANOVAs or
> t-tests on normed and standardized tag frequencies, and can find an array
of
> features which show significant differences, and have run principle
> components analysis to compare what features seem most "salient" across
> genres or L1 groups.
>
> But this begs questions on the relationship between statistical
significance
> and language data. For example, if a particular text is significantly
higher
> in hedges or emphatics, does that mean that the difference would be
> noticeable to a reader? Or conversely, it seems that a non-significant
> feature could still be quite noticeable to the reader. This is especially
> problematic, it seems, with features that have very low frequencies ...
it
> is not difficult to find significant differences, yet with such low
overall
> frequencies, it's hard to assume that the reader would notice the
difference
> between 2 per thousand words and 5 per thousand words.
>
> Of course I could field run an experiment on texts with different degrees
of
> hedges, emphatics and such to rate reader sensitivity ... *sigh* ... but
> that would have to be done feature by feature, and may not adequately
take
> into account the role of co-occurrence of features.
>
> Has anyone else come across literature, or had thoughts on the role of
> statistics in making comparisons between genres, or any other corpus
> comparisons? I have often seen assumptions that significant difference
can
> be used to 'categorize' genres or corpora, but I'm just not comfortable
with
> that yet. I've been struggling with this question for a while and am not
> happy with the options I've come up.
>
> Kristen Precht
> Northern Arizona University
> kprecht@iupui.edu
>