[Corpora-List] UPDATE: Corrected Word frequencies for a large corpus of recent USENET text, and full list of types.

From: Cyrus Shaoul (cyrus.shaoul@ualberta.ca)
Date: Sat Sep 02 2006 - 09:52:51 MET DST

  • Next message: Ute Römer: "[Corpora-List] New book: Corpus Technology and Language Pedagogy"

    Hello Again,

    **
    IMPORTANT: IF YOU DOWNLOADED THE ORIGINAL LIST, PLEASE GET THE CORRECTED
    VERSION. SEE THE NOTE BELOW.
    **

      A "thank you" to all the folk who downloaded the first version of our
    USENET word list. Some people made requests for a larger list of types,
    not restricted to my original dictionary. I have now finished the list
    of all types with frequency greater than 3 tokens/million tokens. It is
    large (28 Mb, compressed), with 5,609,086 types. Unfortunately most of
    the types in this list are URLs, e-mail addresses and other cruft that
    are artifacts of my overly simplistic text processing (delete
    punctuation, and split on whitespace.)

    I know this list is not for everyone, but if you are interested in
    seeing a lot of types, please download the file from here, and please
    send me any feedback you have. I sorted the list by decreasing type
    frequency.
     
    http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html

    WARNING: File size is 28 Mb, compressed

    **
    NOTE: In doing this run, I noticed that my corpus grew in size from 5.9
    to 7.8 billion words, despite the fact that I was using the same raw
    data. I then discovered my bug: I forgot to count non-words in my
    original program. So if you downloaded the original list of 111,627
    words, the corpus size and freq/million numbers are WRONG! The counts
    were correct, though. Please download the corrected list here (914k,
    compressed):

    http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html

    I also sorted this list by decreasing frequency for ease of use.

    Thanks for your understanding,

    Cyrus

    =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
    Cyrus Shaoul
    http://www.psych.ualberta.ca/~westburylab/
    University of Alberta
    780-492-5843
    =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}



    This archive was generated by hypermail 2b29 : Sat Sep 02 2006 - 09:50:55 MET DST