RE: Corpora: frequency lists for clusters & MWU

Christopher Tribble (ctribble@serendib.ccom.lk)
Wed, 7 Oct 1998 10:33:23 +0530

I think I missed the beginning of this dialogue, but if you are interested
in diagrams can I strongly recommend using WordSmith tools, keywords. The
procedure is:
1. generate a diagram wordlist for a reference corpus (LOB will do)
2. generate a diagram wordlist for the research corpus
3. generate a keyword list based on 2 word clusters.

I give below the top 40 of an example using BNC Written core as the
reference corpus (1 million words) and a research corpus of project
proposals I've been using for my PhD. You really can begin to get somewhere
interesting this way.

The columns are a bit all over the place -- you can get round that be
importing the tab delimited file into Excel -- but you can get an idea.
Hope this is useful

Chris Tribble

WordSmith Tools -- 07/10/98 10:30:55
all 499 entries
(tip : convert to table; columns are delimited with tabs)

N WORD FREQ. PHD02.LST % FREQ. BNC2.LST % KEYNESS P
1 THE PROJECT 337 0.29 40 1334.4 0.000000
2 DEVELOPMENT OF 243 0.21 83 786.9 0.000000
3 WILL BE 513 0.45 858 0.08 767.5 0.000000
4 IN POLAND 169 0.15 10 717.7 0.000000
5 OF REFERENCE 147 0.13 0 689.6 0.000000
6 BRITISH COUNCIL 135 0.12 0 633.3 0.000000
7 ENVIRONMENTAL EDUCATION 132 0.12 0 619.2 0.000000
8 TECHNICAL ASSISTANCE 131 0.11 0 614.5 0.000000
9 EXPERIENCE OF 148 0.13 15 597.2 0.000000
10 THE TRAINING 120 0.10 6 515.9 0.000000
11 AND TRAINING 114 0.10 3 507.5 0.000000
12 THE DEVELOPMENT 163 0.14 67 500.7 0.000000
13 TEAM LEADER 100 0.09 0 469.1 0.000000
14 THE PMU 98 0.09 0 459.7 0.000000
15 TERMS OF 162 0.14 94 442.3 0.000000
16 EXPERIENCE IN 94 0.08 0 440.9 0.000000
17 TRAINING AND 113 0.10 16 436.6 0.000000
18 THE TEAM 106 0.09 18 398.1 0.000000
19 IMPLEMENTATION OF 79 0.07 0 370.6 0.000000
20 THE PROGRAMME 100 0.09 22 358.4 0.000000
21 OF TRAINING 85 0.07 8 345.8 0.000000
22 PROJECT MANAGEMENT 73 0.06 0 342.4 0.000000
23 AND MANAGEMENT 72 0.06 0 337.7 0.000000
24 THE CZECH 72 0.06 0 337.7 0.000000
25 MINISTRY OF 92 0.08 19 333.8 0.000000
26 ASSISTANCE TO 76 0.07 3 331.6 0.000000
27 AND SLOVAK 70 0.06 0 328.3 0.000000
28 THE IMPLEMENTATION 68 0.06 0 319.0 0.000000
29 CZECH AND 68 0.06 0 319.0 0.000000
30 THE TERMS 90 0.08 22 315.6 0.000000
31 MANAGEMENT TRAINING 66 0.06 0 309.6 0.000000
32 MANAGEMENT OF 75 0.07 7 305.4 0.000000
33 TRAINING PROGRAMME 65 0.06 0 304.9 0.000000
34 TRAINING PROGRAMMES 64 0.06 0 300.2 0.000000
35 WE PROPOSE 64 0.06 0 300.2 0.000000
36 PHARE PROGRAMME 63 0.06 0 295.5 0.000000
37 PUBLIC ADMINISTRATION 62 0.05 0 290.8 0.000000
38 ORGANISATION AND 62 0.05 0 290.8 0.000000
39 DEVELOPMENT AND 79 0.07 18 281.1 0.000000
40 EMPLOYMENT SERVICE 59 0.05 0 276.7 0.000000

-----Original Message-----
From: Ted E. Dunning [SMTP:ted@aptex.com]
Sent: Wednesday, October 07, 1998 5:55 AM
To: przemka@main.amu.edu.pl
Cc: corpora@hd.uib.no
Subject: Re: Corpora: frequency lists for clusters & MWU

Frequency lists for single words are highly suspect, especially below
roughly the thousandth most common word. The utility of a frequency
list for multi-word units is even more doubtful.

That being said, I would be happy to offer up several of the most
common bigrams from a small corpus (1M words) as an illustration of
how little you are likely to learn from frequency sorting bigrams:

#S The
of the
in the
said #S
to the
AP #S
#S #D
on the
for the
and the
said the
in a
at the
#S He
#S In
by the
to be
#S But
with the
of a

Here #S indicates a sentence boundary and #D a document boundary. The
only items of interest are the bigrams which include the word "said".
Their prevalence is caused by the fact that this text was from the AP
newswire.

There *are* other ways to look at word coocurrence besides frequency
sorting. I tend to like to plug my Computational Linguistics paper
(CL volume 19, number 1, pages 61-74) where I introduced a useful
statistical measure for finding interesting collocations. There are
many other measures which people use for various purposes.

>>>>> "pk" == Przemyslaw KASZUBSKI <przemka@main.amu.edu.pl> writes:

pk> Another question: Are there frequency lists of English
pk> (lemmatised/non-lemmatised) 2-3-4-5 word clusters available
pk> anywhere, preferably retrieved from large balanced corpora? Or
pk> frequency lists of multi-word-units?