Corpora: representativeness and the corpus of spoken Israeli Hebrew

Shlomo Izre'el (sizreel@emory.edu)
Sat, 18 Dec 1999 08:44:38 +0200

--------------FF414A28E12001A7FCB684D7
Content-Type: text/plain; charset=iso-8859-1; x-mac-type="54455854"; x-mac-creator="4D4F5353"
Content-Transfer-Encoding: 8bit

Dear list members,

At the end of February 1999 I issued - on behalf of the team planning
the compilation of a corpus of spoken Hebrew - a request with regard to
corpus representativeness. We have received a few responses, which I
will site below. We have also been asked to publish a summary of the
responses.

I apologize for the long delay in publishing this note, but, as you will
understand from what is said here, it took us some time to make some
advance in this matter.

We have learned a lot during this time. We surveyed much of the
literature, many existing web sites and other sources. We hope that we
are now more ready to address the (extremely difficult) issue of
representativeness in corpus compilation. There is now a web site for
our corpus in planning, and I would refer any interested person to take
a look at it at
<http://spinoza.tau.ac.il/hci/dep/semitic/csih.html>
The reason we have come up with this site so early is our wish to get
feedback even before we start our pilot. All commenst are welcome. We
would very much appreciate your reviewing of our work.

Let me also cite the responses we got for our query:

Stefan Thomas Gries <StThGries@t-online.de> referred us to the three
basic books on corpus linguistics:
"You will find a lot of comments on the issue of
representativeness and compiling a corpus as well as lists
for the most well-known corpora in these books, most notably
the second one by Kennedy:
Biber, Douglas, Susan Conrad und Randi Reppen. Corpus
Linguistics: [*]. Cambridge: Cambridge University Press,
1998
Kennedy, Graeme. An Introduction to Corpus Linguistics.
London und New York: Longman, 1998 (!)
McEnery, Tony und Andrew Wilson. Corpus Linguistics.
Edinburgh: Edinburgh University Press, 1996.

These are very useful, and I should perhaps add that we have learned a
lot from Douglas Biber's work, of which the most recent is his 1995
_Dimensions of Linguistic Variation: A Cross-Linguistic Comparison._
Cambridge: Cambridge University Press.

James L. Fidelholtz, <jfidel@siu.buap.mx> is himself starting to compile
a corpus of Spanish. He says:
"Not much published on it so far, but I have thought about your problem,
and have a couple of suggestions:
UNESCO publishes world data on types of publications in
different languages (as I recall, even by type w.r.t. books). You can
then make some assumptions about the relative number of people that
read, say, each newspaper, compared to the number that read each book,
and jigger the statistics accordingly. Then you'll need either research

or assumptions about the relative proportion of conversation one is
exposed to versus printed information (everything on average, of
course), although, as you are probably aware, doing transcripts of
speech is a couple of orders of magnitude [at least] more difficult (ie
time-consuming) than getting electronic print, so getting transcripts of

speech will likely not be as proportional as you would like."

Khalid CHOUKRI <choukri@elda.fr> referred us to the work done within a
set of European Union funded projects: SpeechDat at the university of
Munich and Babel at Reading university.

Thanks to all those how have shown interest, and especially to those who
have tried to help.

With best wishes for the holiday season and a happy new year to all,

Shlomo Izre'el, on behalf the team of The Corpus of Spoken Israeli
Hebrew:
Benjamin Hary (Emory University)
John Du Bois (UC Santa Barbara)
Mira Ariel (Tel Aviv University)
Eliezer Ben-Rafael (Tel Aviv University)
Yaakov Bentolila (Ben Gurion University)
Giora Rahav (Tel Aviv University)
Otto Jastrow (Universität Erlangen-Nürenberg)
Shmuel Bolozky (UMass, Amherst)
Geoffrey Khan (Cambridge University)

______________________________________________________________________
Shlomo Izre'el, Ph.D.
Professor of Semitic Linguistics
Department of Hebrew and Semitic Languages
Tel Aviv University Home address:
IL-69978 Tel Aviv Simtat Neve-Tsedek 7
Israel IL-65154 Tel
Aviv
Tel. +972-3-640 5017 Israel
Fax. +972-3-640 7031 Tel. +972-3-517 5341
+972-3-640 9457 Fax. +972-3-510 1867
izreel@post.tau.ac.il
http://spinoza.tau.ac.il/hci/dep/semitic/izreel.html
The Corpus of Spoken Israeli Hebrew:
http://spinoza.tau.ac.il/hci/dep/semitic/csih.html

--------------FF414A28E12001A7FCB684D7
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
Dear list members,

At the end of February 1999 I issued - on behalf of the team planning the compilation of a corpus of spoken Hebrew - a request with regard to corpus representativeness. We have received a few responses, which I will site below. We have also been asked to publish a summary of the responses.

I apologize for the long delay in publishing this note, but, as you will understand from what is said here, it took us some time to make some advance in this matter.

We have learned a lot during this time. We surveyed much of the literature, many existing web sites and other sources. We hope that we are now more ready to address the (extremely difficult) issue of representativeness in corpus compilation. There is now a web site for our corpus in planning, and I would refer any interested person to take a look at it at
<http://spinoza.tau.ac.il/hci/dep/semitic/csih.html>
The reason we have come up with this site so early is our wish to get feedback even before we start our pilot. All commenst are welcome. We would very much appreciate your reviewing of our work.
 

Let me also cite the responses we got for our query:

Stefan Thomas Gries <StThGries@t-online.de> referred us to the three basic books on corpus linguistics:
"You will find a lot of comments on the issue of
representativeness and compiling a corpus as well as lists
for the most well-known corpora in these books, most notably
the second one by Kennedy:
Biber, Douglas, Susan Conrad und Randi Reppen.  Corpus
Linguistics: [*].  Cambridge: Cambridge University Press,
1998
Kennedy, Graeme.  An Introduction to Corpus Linguistics.
London und New York: Longman, 1998 (!)
McEnery, Tony und Andrew Wilson.  Corpus Linguistics.
Edinburgh: Edinburgh University Press, 1996.

These are very useful, and I should perhaps add that we have learned a lot from Douglas Biber's work, of which the most recent is his 1995 _Dimensions of Linguistic Variation: A Cross-Linguistic Comparison._ Cambridge: Cambridge University Press.
 

James L. Fidelholtz, <jfidel@siu.buap.mx> is himself starting to compile a corpus of Spanish.  He says:
"Not much published on it so far, but I have thought about your problem, and have a couple of suggestions:
UNESCO publishes world data on types of publications in
different languages (as I recall, even by type w.r.t. books).  You can
then make some assumptions about the relative number of people that
read, say, each newspaper, compared to the number that read each book,
and jigger the statistics accordingly.  Then you'll need either research
or assumptions about the relative proportion of conversation one is
exposed to versus printed information (everything  on average, of
course), although, as you are probably aware, doing transcripts of
speech is a couple of orders of magnitude [at least] more difficult (ie
time-consuming) than getting electronic print, so getting transcripts of
speech will likely not be as proportional as you would like."

Khalid CHOUKRI <choukri@elda.fr>  referred us to the work done within a set of European Union funded projects: SpeechDat at the university of Munich and Babel at Reading university.

Thanks to all those how have shown interest, and especially to those who have tried to help.

With best wishes for the holiday season and a happy new year to all,

Shlomo Izre'el, on behalf the team of The Corpus of Spoken Israeli Hebrew:
   Benjamin Hary (Emory University)
   John Du Bois (UC Santa Barbara)
   Mira Ariel (Tel Aviv University)
   Eliezer Ben-Rafael (Tel Aviv University)
   Yaakov Bentolila (Ben Gurion University)
   Giora Rahav (Tel Aviv University)
   Otto Jastrow (Universität Erlangen-Nürenberg)
   Shmuel Bolozky (UMass, Amherst)
   Geoffrey Khan (Cambridge University)
 

______________________________________________________________________
Shlomo Izre'el, Ph.D.
Professor of Semitic Linguistics
Department of Hebrew and Semitic Languages
Tel Aviv University                             Home address:
IL-69978 Tel Aviv                               Simtat Neve-Tsedek 7
Israel                                                     IL-65154 Tel Aviv
Tel. +972-3-640 5017                           Israel
Fax. +972-3-640 7031                           Tel. +972-3-517 5341
       +972-3-640 9457                            Fax. +972-3-510 1867
izreel@post.tau.ac.il
http://spinoza.tau.ac.il/hci/dep/semitic/izreel.html
The Corpus of Spoken Israeli Hebrew:
http://spinoza.tau.ac.il/hci/dep/semitic/csih.html
  --------------FF414A28E12001A7FCB684D7--