Re: [Corpora-List] Word frequencies for a large corpus of recent USENET text

From: Ramesh Krishnamurthy (r.krishnamurthy@aston.ac.uk)
Date: Thu Aug 31 2006 - 21:14:33 MET DST

Next message: Linguistic Data Consortium: "[Corpora-List] New from the LDC"

Previous message: Cyrus Shaoul: "[Corpora-List] Word frequencies for a large corpus of recent USENET text"
In reply to: Cyrus Shaoul: "[Corpora-List] Word frequencies for a large corpus of recent USENET text"
Next in thread: Cyrus Shaoul: "Re: [Corpora-List] Word frequencies for a large corpus of recent USENET text"
Reply: Cyrus Shaoul: "Re: [Corpora-List] Word frequencies for a large corpus of recent USENET text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Cyrus

a) Is the list in any particular order?

>Number of words: 5894564637
>WORD COUNT FREQPERMILLION
>BESTING 712 0.120789242946086
>PRACTICABLY 98 0.0166254856863995
>BANTERERS 2 0.00033929562625305
>RECLOTHE 89 0.0150986553682607

b) Why are some items given a score of 0?

>CYCLIZES 0 0

>PROCEEDERS 0 0

>DATEDLY 0 0
>TUTOYERED 0 0

c) This means that this cannot be a corpus frequency list, but a
pre-existing wordlist
with corpus frequencies attached?

d) If so, where did the original list come from? Is it a list used
for psycholinguistic recognition
of 'real words' and 'pseudo-words' or something like that?

e) You mention 111,627 English words; another indication that this is
not the entire corpus frequency list,
nor the 'most frequent 111,627 types in the corpus' (as some have a
frequency of 0).

f) If the corpus size is 5,894,564,637 tokens, the entire list cannot
contain only 111,627 types.
The Bank of English corpus in 1993 contained 120,362,928 tokens, and
475,633 types;
in 2000, it contained 418,449,873 tokens and 938,914 types. So a
corpus of 5,894,564,637 tokens
must contain a much larger number of types?

Best
Ramesh

At 17:46 31/08/2006, you wrote:
>Hi All,
>I thought that this might be of interest to the list. I have also
>experimented with using a CC Attribution-NonCommercial-NoDerivs
>license for this word frequency list. Please tell me if you think
>this is a good or a bad idea.
>
>Thanks,
>Cyrus
>
>
>*******
>Announcement: Word frequencies for a large corpus of USENET text released.
>*******
>The Westbury Lab at the University of Alberta does research on lexical
>semantics and other areas of psycholinguistics. Recently, as part of a
>research program investigating high-dimensional models of semantic
>memory, they collected 5,894,564,637 words from 47,860 English
>language, non-binary-file newsgroups from the
>USENET between October 2005 and August 2006. This list of
>orthographic frequencies for 111,627 English words will be
>of use to anyone who has used older lists based on corpora from decades
>past.
>The list is available for download (3.3 MB file) under a Creative
>Commons 2.5 license at:
> http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html
>
>
>=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
>Cyrus Shaoul
>http://www.psych.ualberta.ca/~westburylab/
>University of Alberta
>780-492-5843
>=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
>
>
>
>

Ramesh Krishnamurthy

Lecturer in English Studies, School of Languages and Social Sciences,
Aston University, Birmingham B4 7ET, UK
[Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ;
Fax: +44 (0)121-204-3766
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/

Next message: Linguistic Data Consortium: "[Corpora-List] New from the LDC"
Previous message: Cyrus Shaoul: "[Corpora-List] Word frequencies for a large corpus of recent USENET text"
In reply to: Cyrus Shaoul: "[Corpora-List] Word frequencies for a large corpus of recent USENET text"
Next in thread: Cyrus Shaoul: "Re: [Corpora-List] Word frequencies for a large corpus of recent USENET text"
Reply: Cyrus Shaoul: "Re: [Corpora-List] Word frequencies for a large corpus of recent USENET text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Aug 31 2006 - 21:40:09 MET DST