Re: Corpora: wordcounts

Arnold J Kreps (A.Kreps@let.kun.nl)
Thu, 15 Apr 1999 14:35:38 +0200

David Carlson wrote:

>I have a question about how various programs count words-
>
>I am aware that different programs will give different word counts depending
>on what the programs consider a word.
>However, when I ran three different programs on the same file, I got rather
>different results even for 10 ten function words: "the," (16,321 vs. 15,852
>vs. 15,872 tokens) "of," "and," etc.

Another related question is how the various programs deal with contracted
forms (such as "doesn't"), and whether these programs can distinguish
between the s-genitive and contracted forms of which the last element is
"s" (such as "she's").

And does it matter?

Arnold J Kreps

a.kreps@let.kun.nl

Department of Business Communication Studies
Katholieke Universiteit Nijmegen
Holland
www.kun.nl