corpora & the web

John Milton (lcjohn@uxmail.ust.hk)
Tue, 16 Jul 1996 21:43:07 +0800 (HKT)

A couple of people requested that I post any responses to my query
about web searching. Ken Church sent me this and has given permission to
repost:

I think there is quite a bit of stuff out there that you might find
useful. I'd start with the ECI CD-ROM, if you don't already have it.
Here is something I'm putting together to help.

Text is available like never before. Data collection efforts such as:

the Association for Computational Linguistics' Data Collection Initiative
(ACL/DCI),
the British National Corpus (BNC),
the Consortium for Lexical Research (CLR),
the European Corpus Initiative (ECI),
Electronic Dictionary Research (EDR)
ICAME,
the Linguistic Data Consortium (LDC),

and many others have done a wonderful job in acquiring and distributing
dictionaries and corpora.

For more information on the ACL/DCI and the LDC, see
http://www.cis.upenn.edu/~ldc.
The CLR's web page is: http://clr.nmsu.edu/clr/CLR.html,
and EDR's web page is: http://www.iijnet.or.jp/edr.
Information on the ECI can be found in
http://www.cogsci.ed.ac.uk/elsnet/eci_summary.html,
or by sending email to eucorp@cogsci.edinburgh.ac.uk.
Information on the BNC can be found in http://info.ox.ac.uk/bnc,
or by sending email to smbowie@vax.oxford.ac.uk.
Information on the London-Lund Corpus and other corpora available though
ICAME can be found in the ICAME Journal, edited by Stig Johansson,
Department of English, University of Oslo, Norway.

In addition, there are vast quantities of so-called Information Super
Highway Roadkill: email, bboards, faxes. We now has access to billions
and billions of words, and even more pixels.

Actually, the web isn't as large as you might think. I believe that
Lexis-Nexis still has quite a bit more than the entire web, though
obviously they are very different kinds of texts. My guess is that
Lexis-Nexis has more than a terabyte and the web is still only a few
hundred gigabytes, though the web might be growing faster than
Lexis-Nexis. Lexis-Nexis adds "only" a Brown Corpus an hour -- a
million words per hour. I have no idea how fast the web is growing,
but I could easily believe that it is much faster than that.