Re: Re: [Corpora-List] "normalizing" frequencies for different-sized corpora

From: Jenny Eagleton (jenny@asian-emphasis.com)
Date: Mon Sep 12 2005 - 11:04:05 MET DST

  • Next message: Eric Atwell: "Re: [Corpora-List] "normalizing" frequencies for different-sized corpora"

    Thanks for the quick response from everybody, I
    have got the idea now.

    Jenny
    ----- Original Message -----
    SUBJECT: Re: [Corpora-List] "normalizing"
    frequencies for
    different-sized corpora
    FROM: eric@comp.leeds.ac.uk
    TO: jenny@asian-emphasis.com
    DATE: 12-09-2005 16:59
    Jenny,

    I may be missing something, but I think the way to
    find a
    per-thousand
    figure is simply:
    ( (freq of word) / (no of words in text) ) * 1000

    eg (200/4000) * 1000 = 50

    or (2646/55166) * 1000 = 48 (to nearest whole
    number)

      - of course it's up to you whether to round to
    nearest whole
    n7umber,
        or give the answer to 2 decimal palces (47.96)
     or some other
    level
    of accuracy; but since generally a text is only a
    sample or
    approximation of the language you are studying, it
    is sensible not to
    claim too much accuracy/significance.

    eric atwell
    On Mon, 12 Sep 2005, Jenny Eagleton wrote:

    > Hello Corpora and Statistics Experts,
    >
    > This is a very simple question for all the
    > corpora/statistics experts
    > out there, but this novice is not really
    > mathematically inclined. I
    > understand Biber's principle of "normalization,
    > however I am not sure
    > about how to calculate it. I want frequency
    counts
    > normalized per
    > 1,000 words of text. I can see how to do it if
    the
    > figures are even,
    > i.e. if I have a corpus of 4,000 words and a
    > frequency of 200, 
    > I would have a normalized figure of 50.
    >
    > But for mixed numbers, how would I calculate the
    > following: For
    > example if I have 2,646 instances of a certain
    > kind of noun in a
    > corpus of 55,166 how would I calculate the
    > normalized figure per
    > 1,000 words?
    >
    > Regards,
    >
    > Jenny
    > Research Assistant
    > Dept. of English & Communication
    > City University of Hong Kong
    >
    >
    >

    -- 
    Eric Atwell, Senior Lecturer, Language research
    group, School of
    Computing, 
    Faculty of Engineering, University of Leeds, LEEDS
    LS2 9JT, England
    TEL: +44-113-2335430  FAX: +44-113-2335468 
    http://www.comp.leeds.ac.uk/eric
    



    This archive was generated by hypermail 2b29 : Mon Sep 12 2005 - 11:12:23 MET DST