Re: Corpora: Number of distinct words

From: Giorgio Parisi (Giorgio.Parisi@roma1.infn.it)
Date: Sun Oct 28 2001 - 12:40:00 MET

  • Next message: COMP staff: "Re: Corpora: Number of distinct words"

    On Thu, 25 Oct 2001, Granger Sylviane wrote:

    > Dear list members,
    >
    > Could anyone help me answer the following message which I've just received
    > from a colleague of mine in the Computer Science Department?
    >
    > Many thanks.
    >
    > Have a good day!
    > Sylviane Granger
    >
    > >Since about 1.5 years, a colleague and I have been writing a textbook
    > >on computer programming. I have kept numerous drafts of the book during
    > >this period. Today I was curious to see how these drafts evolved. I
    > >graphed the number of distinct 'words' (character sequences delimited
    > >by noncharacters) as a function of file size. I found that a good fit
    > >is given by the square root function:
    > >
    > > (number of distinct words) = 6 * sqrt(file size)
    > >
    > >Is this an example of a general law? I.e., if the text just repeated
    > >the same over and over the exponent would be zero. If the text was a
    > >long catalogue of facts the exponent would be one. The exponent is
    > >exactly half way in between. Is it because of the structure of the
    > >book (the effort to make it coherent)? I don't know. Any comments or
    > >reactions welcome!
    > >
    > >I know of 'Zipf's Law' : word frequency is (supposedly) inversely
    > >proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
    > >Is the square root a consequence of Zipf's Law? Or is there more going
    > >on?
    > >
    > >Peter Van Roy
    >
    >
    > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    > Professor Sylviane Granger
    > Université Catholique de Louvain
    > Centre for English Corpus Linguistics
    > Collège Erasme
    > Place Blaise Pascal 1
    > B-1348 Louvain-la-Neuve
    > Belgium
    > Fax: + 3210474942
    > http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html
    >
    >
    A strict application of the Zipf's Law implies that the number of
    words is proportional to the log of the file size.
    My impression is this is what happens if you take novels.
    Technical books may behave in a different way.
    Best regards

    Giorgio
    -------------------------------------------------------------------------
    Dipartimento di Fisica Fax +39-06-4463158
    Universita' di Roma "La Sapienza" giorgio.parisi@roma1.infn.it
    P.le A. Moro 2 Tel +39-06-49913481
    Roma, Italy, I-00185 http://chimera.roma1.infn.it/GIORGIO/giorgio.html
    ------------------------------------------------------------------------



    This archive was generated by hypermail 2b29 : Sun Oct 28 2001 - 12:07:28 MET