Corpora: Number of distinct words

From: Granger Sylviane (granger@lige.ucl.ac.be)
Date: Thu Oct 25 2001 - 09:17:16 MET DST

  • Next message: Jean Veronis: "Re: Corpora: Number of distinct words"

    Dear list members,

    Could anyone help me answer the following message which I've just received
    from a colleague of mine in the Computer Science Department?

    Many thanks.

    Have a good day!
    Sylviane Granger

    >Since about 1.5 years, a colleague and I have been writing a textbook
    >on computer programming. I have kept numerous drafts of the book during
    >this period. Today I was curious to see how these drafts evolved. I
    >graphed the number of distinct 'words' (character sequences delimited
    >by noncharacters) as a function of file size. I found that a good fit
    >is given by the square root function:
    >
    > (number of distinct words) = 6 * sqrt(file size)
    >
    >Is this an example of a general law? I.e., if the text just repeated
    >the same over and over the exponent would be zero. If the text was a
    >long catalogue of facts the exponent would be one. The exponent is
    >exactly half way in between. Is it because of the structure of the
    >book (the effort to make it coherent)? I don't know. Any comments or
    >reactions welcome!
    >
    >I know of 'Zipf's Law' : word frequency is (supposedly) inversely
    >proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
    >Is the square root a consequence of Zipf's Law? Or is there more going
    >on?
    >
    >Peter Van Roy

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    Professor Sylviane Granger
    Université Catholique de Louvain
    Centre for English Corpus Linguistics
    Collège Erasme
    Place Blaise Pascal 1
    B-1348 Louvain-la-Neuve
    Belgium
    Fax: + 3210474942
    http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html



    This archive was generated by hypermail 2b29 : Thu Oct 25 2001 - 08:27:32 MET DST