RE: [Corpora-List] UPDATE: Corrected Word frequencies for a large corpus of recent USENET text, and full list of types.

From: Adam Kilgarriff (adam@lexmasterclass.com)
Date: Sat Sep 02 2006 - 23:40:09 MET DST

  • Next message: Paul Thompson: "[Corpora-List] Call for proposals for JEAP Special Issue"

    Just a comment about this kind of resource: wouldn't it be better to make it
    available as a searchable resource, allowing people to specify the searches
    they wanted and check up on anomalous frequencies, rather than distributing
    a frequency list which will inevitably raise many questions, for anyone
    planning to seriously use it, which they won't be able to answer (at least
    not without coming back to you, and their questions won't be your priority)

    Adam

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Cyrus Shaoul
    Sent: 02 September 2006 08:53
    To: corpora@uib.no
    Subject: [Corpora-List] UPDATE: Corrected Word frequencies for a large
    corpus of recent USENET text, and full list of types.

    Hello Again,

    **
    IMPORTANT: IF YOU DOWNLOADED THE ORIGINAL LIST, PLEASE GET THE CORRECTED
    VERSION. SEE THE NOTE BELOW.
    **

      A "thank you" to all the folk who downloaded the first version of our
    USENET word list. Some people made requests for a larger list of types,
    not restricted to my original dictionary. I have now finished the list
    of all types with frequency greater than 3 tokens/million tokens. It is
    large (28 Mb, compressed), with 5,609,086 types. Unfortunately most of
    the types in this list are URLs, e-mail addresses and other cruft that
    are artifacts of my overly simplistic text processing (delete
    punctuation, and split on whitespace.)

    I know this list is not for everyone, but if you are interested in
    seeing a lot of types, please download the file from here, and please
    send me any feedback you have. I sorted the list by decreasing type
    frequency.
     
    http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html

    WARNING: File size is 28 Mb, compressed

    **
    NOTE: In doing this run, I noticed that my corpus grew in size from 5.9
    to 7.8 billion words, despite the fact that I was using the same raw
    data. I then discovered my bug: I forgot to count non-words in my
    original program. So if you downloaded the original list of 111,627
    words, the corpus size and freq/million numbers are WRONG! The counts
    were correct, though. Please download the corrected list here (914k,
    compressed):

    http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html

    I also sorted this list by decreasing frequency for ease of use.

    Thanks for your understanding,

    Cyrus

    =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
    Cyrus Shaoul
    http://www.psych.ualberta.ca/~westburylab/
    University of Alberta
    780-492-5843
    =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}



    This archive was generated by hypermail 2b29 : Sat Sep 02 2006 - 23:38:35 MET DST