Corpora: Summary: Size of representative corpus

Tony Berber Sardinha (tony4@uol.com.br)
Wed, 26 Aug 1998 08:16:46 -0300

Thanks to everyone who responded to my query:

Hypothesis:
'A representative corpus should include the majority of the types
in the language as recorded in a comprehensive dictionary.
Thus:
(a) assuming that a dictionary entry is analogous to a type;
(b) dictionary x is comprehensive
(c) dictionary x has 100,000 entries
(d) a majority is 1/2 + 1
A representative corpus would need to have 50,001 types.'

Questions:
(1) How could we estimate the number of tokens required
to get 50,001 types?'

(2) Would this be a proper criterion? What are the possible
flaws in the argument?

Below is a summary in five parts: Comparison of estimates, question 1
(individual estimates),
question 2 (rationale), bibliography, and online information.

The full thread can be retrieved (I think) from the corpora archive and
(soon)
on http://users.uol.com.br/tony4/homepage.html

============

**** Comparison of estimates: (not readable in proportional font)

Minimum Tokens
Unit Qty Occurrences Total
of each Kilgariff(a) Sanchez & Cantos(b)
Downs(c) Downs (d)
Types 50,001 1 50,001 569,862 858,327
900,000 570,000
50,001 5 250,005 21,458,164
3,250,000
50,001 20 1,000,020 11,000,000 343,330,618
Lemmata 50,001 1 50,001 3,505,417

50,001 5 250,005 87,635,425
50,001 20 1,000,020 1,402,166,808

(a) Kilgariff based on Zipf
(b) Application of formulae in Sanchez and Cantos 1997 (see reference in
this summary)
(c) Downs based on a corpus
(d) Downs based on Zipf

============

**** Question 1 - Estimates:

Hellfried Sabathy:

I would rather argue:
of these 100000 types, 20000 make up 80% of the corpora
from where the dictionary was taken;
therefore, a corpus encompassing "most of" these 20000
types can be considered to model the original corpus in
a representative way.

=================

Adam Kilgariff:

... words are zipf-distributed so you will need about

Sum from (i=1 to i=50001) of 50001/i = 569,862 tokens

However this would be ludicrously small corpus for looking at word use
for all but the very common words. If you want to be able to say
something about the behaviour of the 50,001 most frequent words, you
need, say, 20 instances minimum of each so the constant number (rank x
freq) is now 50,001x20, so the sum is 20 times higher at 11 million
words.

================

Iain Downs:

I rather think that the size of a 'representative corpus' depends on what
you want to do with it.

For example, if you start from the assumption that all your 50,001 words
must appear in it, then you must ask, 'how many times?'.

The way in which a word is used does not really pop out until you've seen a

number of examples (personally, I tend to use 5 as a lower limit, but that
is for VERY empirical work).

Because of Zipfs law, to go from at least 1 occurence in the corpus to at
least 5 requires more than 5 times the size of text (assuming that you are
picking your text randomly rather than to suit your need to get 5
examples!).
(...)
In
scanning a 90 Million word corpus taken from commercial news sources, I
found that the number of distinct word forms increased as the square root
of the number of words - roughly half a million separate word forms for the

90 million words... By this estimate. 50,000 words would require 900,000
tokens ).

Another way to got at the number of tokens is to use Zipfs law as a
sequence (frequency is porportional to 1/rank)

So if frequency or rank 50,000 is one, then frequency 1 = 50000 so total
tokens = 50000 * (1/1 + 1/2 .. + 1/50000) which a little program I've wrote

(rather than try maths!) indicates is 570,000.

to get the 5 at the 50,0000 rank seems to require 250,000 distinct
wordforms, requiring some 3.25 million words

You will notice that the theory and experiment at 50,000 words are out by a

factor of less than 2 - not bad, eh?

They're worse at the larger numbers - perhpas Zipf didn't have the benefit
of large computers for his word counting and the 1/n rule is a poor
approximation at large numbers!

============

**** Question (2) - Rationale:

Jon Mills:

A dictionary entry more usually relates to a lexeme and
a lexeme may be realised by a number of types. One also
has to consider how the dictionary that you are using
treats derivatives (as run-ons or as separate entries).
There is also a sort of circularity in the notion of
"comprehensive dictionary". Isn't a "comprehensive
dictionary" one that includes entries for the majority
of lexical items found in the corpus?

============

Markus Schulze:

Furthermore, the notion of "representativeness" of a corpus should
include the aspect of frequency of lexemes (or even free morphemes in
order to properly handle derivatives and compounds). If the aspect of
frequency is not regarded, you might as well just take the
comprehensive dictionary.

============

Pascual Cantos Gomez:

It would be useful if you start by making a distinction between
lemma/lexeme, type and token. Consider the following word sequence: "plays,
playing, played, play, plays, play, playing, played and played", where we
have nine words (tokens), four word forms (types) and one lemma, namely
"play"

============

Dr Michael Klotz:

It seems to me that the basic type-unit is not the lemma but what
Cruse calls the lexical unit, i.e. "a lexical form with a single
sense". This is all the more important, since different lexical units
that share a lexical form can behave differently e.g. with regards to
subcategorisation. For example, there is "be friendly to" (i.e.
behave in a friendly way) and "be friendly with" (i.e. be friends
with). In a representative corpus you would want to make sure that
both senses of "friendly" are covered. Once you take meaning into
account, your estimate will be much higher of course.

============

Michael Rundell:

This is absolutely right - the earlier focus just on "number of types"
seems
way too simplistic - and anyway once you get past the first 10-15K most
common words, frequency statistics become unreliable and extremely variable
across different corpora of similar size but different content .
Consider a type like "bond": if you have a corpus made up of Wall St Jnl,
you will have 1000s of instances of bond - but they will *all* be about
govt
bonds, junk bonds etc. If yr corpus is chemical abstracts you will also
have
1000s of bonds - but this time "co-valent bonds", "molecular bonds" etc
etc;
similarly if yr corpus is legal texts - more bonds, but just of one
specific
type.
None of these corpora will have instances of the *other* kinds of bond,
and
none will have instances either of more metaphorical uses ("lifetime bonds
of friendship" etc - you might need a fiction corpus to collect more of
those). This is why lexicographers are suspicious of the type of large
corpus (typically news text) that is cheap and easy to collect in volume -
but which can't give a v balanced picture of the full semantic/grammatical
spectrum. Each of the separate corpora mentioned above is representative -
to a degree - of its own world of discourse, but not of the language as a
whole. But most dictionary people now accept that representativeness of the
whole language isn't a realistic goal - but achieving a reasonable balance
of text-types and registers is still worth aiming for, and for this you
have
to have some sort of top-down approach.

============

Adam Kilgariff:

(...) despite Cruse's efforts, the
'lexical unit' is severly lacking in a definition, from both a
practical and a theoretical perspective. So we can't (even in
principle) produce a list of them and we certainly can't count them.

see eg http://www.itri.bton.ac.uk/~Adam.Kilgarriff/beleive.ps.gz

============

**** Bibliography

Pascual Cantos Gomez:

[1]
Biber, D. (1993) "Representativeness in Corpus Design". Literary and
Linguistic Computing 8(4): 243-57.

[2]
There is a transitive relationship between lemmas, types and tokens subject
to be mathematically modelled. This holds for the research we carried out
for Spanish and English. Our analytic technique for predicting types and
lemmas is simple and straightforward and the resulting formulae are easy to
use, flexible and can be applied quickly to any corpora or language samples
(at least for Spanish and English). You can find the formulae and the
discussion in our recently published article:

Sánchez, A. and P. Cantos (1997) "Predictability of Word Forms (Types) and
Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the
CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish".
International Journal of Corpus Linguistics 2(2): 259-280.
(See abstract http://solaris3.ids-mannheim.de/~ijcl/ijcl-2-2.html).

In addition, there is a forthcoming article (in Spanish), where we carried
out a comparison between English and Spanish regarding types and lemmas
growth and predictability

============

Ted Dunning:

Efron and Thistead (and earlier, Good and Turing) have analyzed this
problem.

See the following article for a discussion of the problem with further
references.

@article{efron87,
author={Bradley Efron and Ronald Thisted},
year=1987,
title={Did Shakespear write a newly discovered poem?},
journal={Biometrika},
volume=74,
pages={445-455}
}

============

Henry Kucera:

John B. Carroll and his model of lognormal distribution makes the
predictions for English: "On Sampling from a lognormal model of
word-frequency distribution," In H. Kucera and W.N. Francis,
Computational Analysis of Present-Day American English, Brown
University Press, Providence, RI 1967, pp.406-424

Carroll's analysis is based on the graphic definition of types, i.e.
distinct forms, not on lexemes (or lemmas, as a group of forms is
usually called). The quantitive relation between types in this sense
and lemmas is discussed at length in Francis and Kucera, Frequency
Analysis of English Usage, Houghton Mifflin Co., Boston, 1982

============

**** Online sources

Chris Hogan:

Granted that Zipf didn't have the benefits of large computers (or
even, presumably, large corpora) when he formulated his laws.
Nevertheless, I do believe that he tested it on a large amount (for
his time) of data.

The times that I have tested data against Zipf's laws, the agreement
has been fairly good.

A very interesting Web page on this topic is the following:
http://sun1.bham.ac.uk/G.Landini/evmt/zipf.htm

The page is about applying Zipf's laws to the Voynich manuscript, but
it has a very good description of Zipf's laws, and several references
concerning modifications to the laws to make the more closely model
the data.

============

Adam Kilgarriff:

[re: Cruse's 'lexical unit', see comments above]

see eg http://www.itri.bton.ac.uk/~Adam.Kilgarriff/beleive.ps.gz

============

Pascual Cantos Gomez:

http://solaris3.ids-mannheim.de/~ijcl/ijcl-2-2.html

for abstract of:

Sánchez, A. and P. Cantos (1997) "Predictability of Word Forms (Types) and
Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the
CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish".
International Journal of Corpus Linguistics 2(2): 259-280.

------------------------------------------------------------------------
Dr Tony Berber Sardinha
Catholic University of Sao Paulo, Brazil
tony4@uol.com.br
http://sites.uol.com.br/tony4/homepage.html
http://www.liv.ac.uk/~tony1/homepage.html
http://www.liv.ac.uk/~tony1/corpus.html
http://members.wbs.net/homepages/c/o/r/corpuslinguistics.html
------------------------------------------------------------------------