Re: Corpora: representativeness

James L. Fidelholtz (jfidel@siu.buap.mx)
Mon, 1 Mar 1999 09:14:53 -0600 (CST)

On Sun, 28 Feb 1999, Shlomo Izre'el wrote:

>..., we are now looking for information on how people
>who have already worked on corpus compilation solved the issue of
>representativeness in register recording.
[snip]
>Any suggestions where we can we go and look (electronic sites or
>papers/books)?
>We'd appreciate any suggestions, references and advice in this respect.

Shlomo:
I am starting to compile a corpus of Spanish. Not much
published on it so far, but I have thought about your problem, and have
a couple of suggestions:
UNESCO publishes world data on types of publications in
different languages (as I recall, even by type w.r.t. books). You can
then make some assumptions about the relative number of people that
read, say, each newspaper, compared to the number that read each book,
and jigger the statistics accordingly. Then you'll need either research
or assumptions about the relative proportion of conversation one is
exposed to versus printed information (everything on average, of
course), although, as you are probably aware, doing transcripts of
speech is a couple of orders of magnitude [at least] more difficult (ie
time-consuming) than getting electronic print, so getting transcripts of
speech will likely not be as proportional as you would like. Hope this
helps.
Jim

James L. Fidelholtz e-mail: jfidel@siu.buap.mx
Maestri'a en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Beneme'rita Universidad Auto'noma de Puebla, ME'XICO