RE: Corpora: Statistics in genre differences

Kristen Precht (kprecht@iupui.edu)
Mon, 22 Mar 1999 20:54:42 -0500

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Yorick Wilks: "Corpora: EPSRC studentship in NLP/CL available now at Sheffield"
Previous message: Dan I. SLOBIN: "Re: Corpora: Question from the University of Padua"

This is a multi-part message in MIME format.

------=_NextPart_000_001A_01BE74A6.35AD5E20
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Actually, what I meant was not exactly the same word, but where a category
of words, such as hedges, are 2 or 5 per 1000. For example, you could have
the words nearly, seems, approximately, as hedges, and I'd like to see
whether other such categories would be noticed.

I've tried Ken Litkowski's MCAA analysis, at his suggestion, on the texts,
but am finding that features like hedges, intensifiers, totality markers are
not marked as "content", it seems, and tend to end up in the uncategorized
section. This is a fascinating methodology, though, and it may well be the
method to pursue in getting at some differences. Has there been other work
on such interpretive markings in content analysis? I'm afraid my background
is more in discourse analysis than NLP.

Tony's suggestions are great: to look at the differences between the units
readers focus more closely on.

Thanks so much for all suggestions for further investigation.

Kristen Precht

Northern Arizona University

-----Original Message-----
From: James L. Fidelholtz [mailto:jfidel@siu.buap.mx]
Sent: Monday, March 22, 1999 9:49 AM
To: kprecht@iupui.edu
Cc: CORPORA@uib.no
Subject: Re: Corpora: Statistics in genre differences

On Fri, 19 Mar 1999, Kristen Precht wrote:
[snip]
>..., it's hard to assume that the reader would notice the difference
>between 2 per thousand words and 5 per thousand words.

Not only is it not hard, it is impossible to assume that they WOULDN'T
notice such a gross difference

------=_NextPart_000_001A_01BE74A6.35AD5E20
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">

Actually, what I meant was not exactly the same word, = but where=20 a category of words, such as hedges, are 2 or 5 per 1000. For example, = you could=20 have the words nearly, seems, approximately, as hedges, and I'd = like to=20 see whether other such categories would be noticed.

I've tried Ken Litkowski's MCAA analysis, at his = suggestion, on=20 the texts, but am finding that features like hedges, intensifiers, = totality=20 markers are not marked as "content", it seems, and tend to end = up in=20 the uncategorized section. This is a fascinating methodology, = though, and=20 it may well be the method to pursue in getting at some = differences. Has=20 there been other work on such interpretive markings in content analysis? = I'm=20 afraid my background is more in discourse analysis than NLP. =

Tony's suggestions are great: to = look at the=20 differences between the units readers focus more closely on. =

Thanks so much for all suggestions for further = investigation.=20

Kristen Precht

Northern Arizona University

-----Original=20 Message-----
From: James L. Fidelholtz [mailto:jfidel@siu.buap.mx]
Sent: Monday, March = 22, 1999=20 9:49 AM
To: kprecht@iupui.edu
Cc:=20 CORPORA@uib.no
Subject: Re: = Corpora:=20 Statistics in genre differences

On Fri, 19 Mar 1999, Kristen = Precht=20 wrote:
[snip]
>..., it's hard to assume that the reader would = notice=20 the difference
>between 2 per thousand words and 5 per thousand=20 words.

Not only is it not hard, it is impossible to assume that = they=20 WOULDN'T
notice such a gross difference

------=_NextPart_000_001A_01BE74A6.35AD5E20--

Next message: Yorick Wilks: "Corpora: EPSRC studentship in NLP/CL available now at Sheffield"
Previous message: Dan I. SLOBIN: "Re: Corpora: Question from the University of Padua"