Re: Re: [Corpora-List] "normalizing" frequencies for different-sized corpora

From: Peter K Tan (PeterTan@leonis.nus.edu.sg)
Date: Tue Sep 13 2005 - 04:49:14 MET DST

Next message: Mikko Kurimo: "[Corpora-List] Unsupervised segmentation of words into morphemes -- Challenge 2005"

Previous message: Nicole Adamides: "[Corpora-List] Conference: Translating and the Computer 27"
In reply to: Jenny Eagleton: "Re: Re: [Corpora-List] "normalizing" frequencies for different-sized corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hullo Jenny! Merely to add that depending on the kind of phenomenon you are examining and the frequency, it is possible to normalise to a per ten-thousand figure too.

The thing to watch out for if you're working with corpora of different sizes is that the total number of lemmas (lemmata/types) will increase with a bigger corpus, so that statements about statements about the top x% of lemmas will not be meaningful for corpora of different sizes (eg 'the word "confluence" is in the group of 40% most frequently occurring word').

Cheers,
Peter (who met you at Asialex)

At 17.04 12/9/2005 +0800, Jenny Eagleton wrote:

Thanks for the quick response from everybody, I have got the idea now.

Jenny

----- Original Message -----

Subject: Re: [Corpora-List] "normalizing" frequencies for different-sized corpora

From: eric@comp.leeds.ac.uk

To: jenny@asian-emphasis.com

Date: 12-09-2005 16:59

Jenny,

I may be missing something, but I think the way to find a per-thousand

figure is simply:

( (freq of word) / (no of words in text) ) * 1000

eg (200/4000) * 1000 = 50

or (2646/55166) * 1000 = 48 (to nearest whole number)

- of course it's up to you whether to round to nearest whole n7umber,

or give the answer to 2 decimal palces (47.96) or some other level

of accuracy; but since generally a text is only a sample or

approximation of the language you are studying, it is sensible not to

claim too much accuracy/significance.

eric atwell

On Mon, 12 Sep 2005, Jenny Eagleton wrote:

> Hello Corpora and Statistics Experts,

>

> This is a very simple question for all the

> corpora/statistics experts

> out there, but this novice is not really

> mathematically inclined. I

> understand Biber's principle of "normalization,

> however I am not sure

> about how to calculate it. I want frequency counts

> normalized per

> 1,000 words of text. I can see how to do it if the

> figures are even,

> i.e. if I have a corpus of 4,000 words and a

> frequency of 200, 

> I would have a normalized figure of 50.

>

> But for mixed numbers, how would I calculate the

> following: For

> example if I have 2,646 instances of a certain

> kind of noun in a

> corpus of 55,166 how would I calculate the

> normalized figure per

> 1,000 words?

>

> Regards,

>

> Jenny

> Research Assistant

> Dept. of English & Communication

> City University of Hong Kong

>

>

>

--

Eric Atwell, Senior Lecturer, Language research group, School of Computing,

Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England

TEL: +44-113-2335430 FAX: +44-113-2335468 http://www.comp.leeds.ac.uk/eric

Next message: Mikko Kurimo: "[Corpora-List] Unsupervised segmentation of words into morphemes -- Challenge 2005"
Previous message: Nicole Adamides: "[Corpora-List] Conference: Translating and the Computer 27"
In reply to: Jenny Eagleton: "Re: Re: [Corpora-List] "normalizing" frequencies for different-sized corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Sep 13 2005 - 05:28:48 MET DST