Corpora: Questions re Databases

Steve-Henser (Steve-Henser@email.msn.com)
Mon, 22 Nov 1999 13:02:31 -0000

Dear Colleagues,

I am a University of London external (=working largely on my own) Ph.D.
candidate working in the field of psycholinguistics. I have three questions
I would like to pose to member of this newsgroup. All three are concerning
the use of databases. Questions 1 & 2 pertain to the use of the MRCPD
Getentry database; question 3 is a request for information re Japanese
databases.

QUESTIONS:

1. I have since successfully downloaded all the files in
ftp://ota.ox.ac.uk/pub/ota/public/dicts/ on to my hard drive. I have two
questions related to interpreting the output of this database: first of all
(I know that this is a really basis question, but please bear with me), I am
using the Getentry files to perform item-driven searches for word
frequencies, and have run into a few problems interpreting the database's
search output (your explanatory notes notwithstanding), and am hoping you
can set me back on track. I understand that the output of a given item is in
the format: 040320021615167000000093057530228435500000 JJ
SABLE|eI/bl|eIbl|20. Comparing this sample with the column number key in
Table 1, I note that columns 6-10 give the Kucera and Francis written
frequency for the search item, hence, in the example given here, "sable" has
a K-F frequency of 216. Is this the correct way of interpreting the output?

2. I fed the following items into the Getentry search :

people with consciences suffer for their mistakes jumping from an aeroplane
with no parachute is safe a shopping spree gives many people an emotional
lift it's hard to relax with a deadline hanging over your head cynic is
always looking on the bright side of things freedom of speech is encouraged
in a dictatorship little forethought can prevent many accidents Hitler was
totally without racial prejudice think more rationally when they are afraid
those with difficulty sleeping sometimes count sheep

The search ignored all the words beginning with the letters "A" through to
"H," starting its output with "IN." A comparatively common word such as
"jumping" returned a K-F frequency of only 9, if I am reading the output
(0700200009040080000000000000000000000000000QV JUMPING||'dZVmpIN Q|),
correctly. Can you explain why all items were not returned, and whether or
not I am interpreting the data correctly?

3. I am doing psycholinguistic research using Japanese and English. I am
trying to locate a suitable database for Japanese word frequencies. NTT
Communication Science Laboratories have just put out what appears to be the
kind of thing that I'm looking for: their "Nihongo no Goi Tokusei" database
(Authors: Shigeaki Amano and Tadahisa Kondo), but the price is very
expensive - it consists of 11 manuals + 9 CD-ROMs at a total cost of 230,000
Yen, a price way beyond my pocket as a self-funded researcher.

Does anyone know of a database that I can use cheaply (preferably for
free!), perhaps in interactive database, or one that can be easily
downloaded from the Internet? I am looking for a database that I can perform
ITEM-driven (as opposed to feature/parameter driven) search with. That is to
say, a database that will allow me to perform word frequencies searches for
specific words - i.e. where I can feed a word in and ask the programme what
its frequency is a single lexical item, what its frequency is when computed
along with its related forms (e.g. beautiful, beauty, beauteous) etc.

I tried writing to the people at CELEX (the psycholinguistic database
project at the Max Plank Institute at Nijmegen in the Netherlands). They
suggested having a look at the EDR corpus project, based on the Mainichi
and Nikkei newspaper corpora or the Real World Computing project, which has
resulted in an annotated Mainichi Shimbun paper corpus
(http://cl.aist-nara.ac.jp/lab/resource/resource.html and
http://cl.aist-nara.ac.jp/lab/resource/cdrom/Nikkei/NKS.html as well as
http://www.rwcp.or.jp/ on the RWCP), but all I was able to find was an
advertisement for a CD-ROM set that was almost, but not quite, as expensive
as the NTT set. Can anyone come up with any other suggestions?

Yours With Thanks,
Steve Henser

P. S. Is there anywhere that I can download a version of Kucera and Francis,
W.N. (1967). Computational Analysis of Present-Day American English.
Providence: Brown University Press?

Steve Henser,
174, Pennant Road,
Llanelli,
Carms. SA14 8HN
United Kingdom
Tel: 44-1554-753428
e-mail: Steve-Henser@msn.com