Re: Corpora: MWUs and frequency

Oliver Mason (oliver@clg.bham.ac.uk)
Fri, 9 Oct 1998 09:31:12 +0100

> Eg, some words are
> 'naturally' underrepresented in the data (eg colloquial words like
> 'berserk').
In the Bank of English, `berserk' has the following counts:

Sun/News of the World 5.2/million
Today 3.6/million
British Magazines 2.0/million
Spoken British 1.6/million
Oz-Newspapers 1.5/million
British Books 1.4/million
The Guardian 1.4/million
US Books 1.3/million
The Times 1.2/million
The Independent 1.1/million
NPR (US Radio) 0.8/million
New Scientist 0.5/million
US-Newspapers 0.3/million
BBC 0.3/million
Economist 0.2/million

So this would then qualify it as `familiar' for readers of the `Sun', but
unfamiliar for readers of the Economist... I would say it obviously depends
heavily on the contents of your corpus. I don't have access to it, but I am
sure the counts in COLT (Corpus of London Teenager English) would be even
higher than in the British tabloids (but intuition doesn't count anyway).

Even something as `simple' as a word frequency list is full of pitfalls and
leads back to one of the fundamental problems of corpus linguistics: how
valid are the statements we make about our findings in general? Do we need
a representative corpus, and how does it look like? What are we looking at
in the first place? The dilemma (in my humble opinion) is that nobody in the
world has exactly the same amount of exposure to newspaper language, spoken
language of different style levels etc., so that it is impossible to define
what `representative' is in an definitive way. Whose language are we
modeling, then? Any answers to this question would be greatly appreciated...

Oliver Mason

-- 
//\\ computer officer | corpus research | department of english | school of  -
//\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt  -
\\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\  -
\\// mobile 07050 104504 | http://www-clg.bham.ac.uk | o.mason@bham.ac.uk\/  -