Re: [Corpora-List] Word frequencies for a large corpus of recent USENET text

From: Ramesh Krishnamurthy (r.krishnamurthy@aston.ac.uk)
Date: Thu Aug 31 2006 - 21:14:33 MET DST

  • Next message: Linguistic Data Consortium: "[Corpora-List] New from the LDC"

    Hi Cyrus

    a) Is the list in any particular order?

    >Number of words: 5894564637
    >WORD COUNT FREQPERMILLION
    >BESTING 712 0.120789242946086
    >PRACTICABLY 98 0.0166254856863995
    >BANTERERS 2 0.00033929562625305
    >RECLOTHE 89 0.0150986553682607

    b) Why are some items given a score of 0?

    >CYCLIZES 0 0

    >PROCEEDERS 0 0

    >DATEDLY 0 0
    >TUTOYERED 0 0

    c) This means that this cannot be a corpus frequency list, but a
    pre-existing wordlist
    with corpus frequencies attached?

    d) If so, where did the original list come from? Is it a list used
    for psycholinguistic recognition
    of 'real words' and 'pseudo-words' or something like that?

    e) You mention 111,627 English words; another indication that this is
    not the entire corpus frequency list,
    nor the 'most frequent 111,627 types in the corpus' (as some have a
    frequency of 0).

    f) If the corpus size is 5,894,564,637 tokens, the entire list cannot
    contain only 111,627 types.
    The Bank of English corpus in 1993 contained 120,362,928 tokens, and
    475,633 types;
    in 2000, it contained 418,449,873 tokens and 938,914 types. So a
    corpus of 5,894,564,637 tokens
    must contain a much larger number of types?

    Best
    Ramesh

    At 17:46 31/08/2006, you wrote:
    >Hi All,
    >I thought that this might be of interest to the list. I have also
    >experimented with using a CC Attribution-NonCommercial-NoDerivs
    >license for this word frequency list. Please tell me if you think
    >this is a good or a bad idea.
    >
    >Thanks,
    >Cyrus
    >
    >
    >*******
    >Announcement: Word frequencies for a large corpus of USENET text released.
    >*******
    >The Westbury Lab at the University of Alberta does research on lexical
    >semantics and other areas of psycholinguistics. Recently, as part of a
    >research program investigating high-dimensional models of semantic
    >memory, they collected 5,894,564,637 words from 47,860 English
    >language, non-binary-file newsgroups from the
    >USENET between October 2005 and August 2006. This list of
    >orthographic frequencies for 111,627 English words will be
    >of use to anyone who has used older lists based on corpora from decades
    >past.
    >The list is available for download (3.3 MB file) under a Creative
    >Commons 2.5 license at:
    > http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html
    >
    >
    >=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
    >Cyrus Shaoul
    >http://www.psych.ualberta.ca/~westburylab/
    >University of Alberta
    >780-492-5843
    >=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
    >
    >
    >
    >

    Ramesh Krishnamurthy

    Lecturer in English Studies, School of Languages and Social Sciences,
    Aston University, Birmingham B4 7ET, UK
    [Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ;
    Fax: +44 (0)121-204-3766
    http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

    Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/



    This archive was generated by hypermail 2b29 : Thu Aug 31 2006 - 21:40:09 MET DST