[Corpora-List] 8 billion word English USENET corpus available for download. [Beta Version]

From: Cyrus Shaoul (cyrus.shaoul@ualberta.ca)
Date: Thu Jan 25 2007 - 23:29:03 MET

  • Next message: Nancy Ide: "[Corpora-List] CFP: Workshop on Linguistic Annotation (ACL2007)"

    Fellow list members,

    After getting some feedback from CORPORA-folk, I have been able to work
    out a way to distribute
    a BETA VERSION of my USENET corpus to anyone who needs it over the
    Internet. It is
    now available under a Creative Commons license at:

        
    http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

    (It is currently around 11Gb in size(compressed), split into smaller
    files for downloading.)

    This corpus should be continuously available, so if there are any
    researchers out there who have been looking
    for a freely available corpus of USENET postings to collaborate on, enjoy.

    The corpus contains a large selection of newsgroups, and a very low
    percentage of non-English data. It covers the
    period from Oct 2005 to last month. I will try to keep on adding new
    data to it every month,
    so keep coming back if you would like updates.

    Due to a network usage policy at my institution, I had to restrict the
    download service to people
    who use computers that are on academic networks.
    I wish I could remove this restriction, but unfortunately it is a policy
    that is I cannot do anything about, so if
    you are denied access due to your network type, please don't ask me to
    make an exception.. I can't!

    If you are looking for the orthographic frequencies for the most common
    tokens in the corpus, there are still available (to all) at:

        
    http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html

    Yours,

    Cyrus

    -- 
    =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
    Cyrus Shaoul
    http://www.ualberta.ca/~cshaoul/
    =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
    




    This archive was generated by hypermail 2b29 : Fri Jan 26 2007 - 00:01:03 MET