Fellow list members,
After getting some feedback from CORPORA-folk, I have been able to work
out a way to distribute
a BETA VERSION of my USENET corpus to anyone who needs it over the
Internet. It is
now available under a Creative Commons license at:
http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
(It is currently around 11Gb in size(compressed), split into smaller
files for downloading.)
This corpus should be continuously available, so if there are any
researchers out there who have been looking
for a freely available corpus of USENET postings to collaborate on, enjoy.
The corpus contains a large selection of newsgroups, and a very low
percentage of non-English data. It covers the
period from Oct 2005 to last month. I will try to keep on adding new
data to it every month,
so keep coming back if you would like updates.
Due to a network usage policy at my institution, I had to restrict the
download service to people
who use computers that are on academic networks.
I wish I could remove this restriction, but unfortunately it is a policy
that is I cannot do anything about, so if
you are denied access due to your network type, please don't ask me to
make an exception.. I can't!
If you are looking for the orthographic frequencies for the most common
tokens in the corpus, there are still available (to all) at:
http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html
Yours,
Cyrus
-- =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=} Cyrus Shaoul http://www.ualberta.ca/~cshaoul/ =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
This archive was generated by hypermail 2b29 : Fri Jan 26 2007 - 00:01:03 MET