Re: Corpora: Relatve text length

From: Yorick Wilks (yorick@dcs.shef.ac.uk)
Date: Thu Apr 25 2002 - 17:56:15 MET DST

  • Next message: Chin-Yew Lin: "Corpora: 2nd CFP: COLING 2002 - Workshop on Multilingua Summarization and Question Answering"

    Isnt there some (minor) confusion here? If the question really is relative TEXT
    length,
    then nothing to do with word counts will settle it--what matters is character
    counts, since word length
    varies considerably between languages. The table showed 1984 in Estonian as
    having far fewer word
    tokens in it than the English original, but I'd bet theyre much longer
    ones--how about the texts then??
    I have no parallel texts with English and E. European languages but I do with
    the four major W. European ones
    and the English pages are shorter in every case.
    Yorick Wilks

    James L. Fidelholtz" wrote:

    > Andrew and Spela:
    > Just a word of caution: studies like Spela's provide interesting
    > and suggestive data, but figures will surely vary, depending on the
    > translator, topic, etc. [all the usual sociolinguistic caveats apply
    > here] (and note Jean's contribution, with varying rates). I was
    > coauthor of a study comparing English and Spanish, which basically tried
    > to get Spanish to fit into the standard readability curves in a fairly
    > simple way. We were only partially successful (the counts were
    > hand-done by yours truly, featuring a variety of types of text,
    > pseudo-randomly sampled, and especially translations from one
    > language to the other, as well as translations from 3rd languages
    > [French & German] into each). To the best of my recollection (I could
    > look up the exact figures if anyone is hot for them), our results for
    > Spanish-English were rather close to Jean's for French (I assume his
    > were on large amounts of text done by computer--if this holds up [not
    > surprising, given the close relationship of French and Spanish], it may
    > indicate that, for this kind of data, not such a huge amount of text is
    > really necessary).
    >
    > On Wed, 24 Apr 2002, spela vintar wrote:
    >
    > >
    > >Hi Andrew,
    > >
    > >for Eastern-European languages you can compare the lengths of Orwell's 1984
    > >and its translations that were collected within the Multext-East project.
    > >The original Multext project (http://www.lpl.univ-aix.fr/projects/multext/)
    > >should provide the same for English, German, French, Spanish etc., however I
    > >wasn't able to find it on their homepage at first glance...
    > >
    > >Best,
    > >Spela
    > >
    > >http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
    > >//////////////
    > >...
    > >Below we give an estimate for the number of words, by language. The
    > >wordcounts were produced by removing the SGML tags from the texts and then
    > >using a 'wc'-like procedure.
    > >
    > > English
    > > 104.302
    > > Romanian
    > > 101.460
    > > Slovene
    > > 91.619
    > > Bulgarian
    > > 87.235
    > > Czech
    > > 80.366
    > > Hungarian
    > > 81.147
    > > Estonian
    > > 79.334
    > >
    > >
    > >Andrew Bredenkamp wrote:
    > >
    > >> Hello everyone,
    > >>
    > >> Does anyone know where I can find a list of relative text length?
    > >>
    > >> Taking one language as an index (100), I would like a list of the (other)
    > >> main European languages - e.g. (made up):
    > >>
    > >> Spanish: 100
    > >> English: 105
    > >> French: 110
    > >> German: 85
    > >>
    > >> ... etc.
    > >>
    > >> Thanks a lot in advance for any help you can give me.
    > >>
    > >> Cheers,
    > >> Andrew
    > >> =========================================
    > >> Andrew Bredenkamp
    > >> acrolinx GmbH
    > >> URL: www.acrolinx.com
    > >>
    > >> =========================================
    > >
    > >
    > >
    >
    > --
    > James L. Fidelholtz e-mail: jfidel@siu.buap.mx
    > Posgrado en Ciencias del Lenguaje tel.: +(52-2)229-5500 x5705
    > Instituto de Ciencias Sociales y Humanidades fax: +(01-2) 229-5681
    > Benemérita Universidad Autónoma de Puebla, MÉXICO



    This archive was generated by hypermail 2b29 : Thu Apr 25 2002 - 17:56:58 MET DST