Re: [Corpora-List] Web/Corpora Questions

From: Mike Maxwell (
Date: Mon Oct 20 2003 - 21:14:43 MET DST

    William Fletcher wrote:
    > Personally I believe for the major languages the Web is most useful
    > for compiling ad-hoc corpora of texts dealing with specific domains
    > or emerging usage, or else for answering specific questions...

    A general comment (probably not relevant to the original query, but perhaps
    of interest to other readers). At the Linguistic Data Consortium, we've
    been reasonably successful at collecting corpora of non-major languages. We
    found substantial quantities of text (especially news text ) for Hindi,
    although that is probably not one would call a minority language in terms of
    population. But even for Cebuano, a Philipine language with about 15
    million speakers, we found well over 100K words of text. I've run trial
    searches for Tzeltal (a Mayan language with 100k+ speakers) turning up some
    texts (mainly collected by anthropologists). For Shuar (an indigenous
    language of Ecuador, 30k speakers), I was able to come up with a few hits,
    although they were pretty much limited to a Bible translation into that

    Finding texts on the web in smaller languages is pretty much a hit-(pardon
    the pun) and-miss thing. Obviously, a lot depends on the number of people
    in the country who have web access, although that seems to be less of a
    consideration than I would have thought; the other big factor seems to be
    the official status of the language in the country. Among Indonesian
    languages, for example, it's very difficult to find anything that's not in
    Bahasa Indonesian. In the cases of non-official languages, you often get
    more hits outside the country than you do inside--expat populations are
    often more likely to have web access and web sites than people inside the
    country, from my admittedly limited experience.

    Another thing that makes it difficult to track down corpora for smaller
    languages is the fact that encodings and even writing systems are not
    standardized. It's not too bad if the language's phonology is
    simple--Cebuano, for instance. But if the language has sounds which are not
    readily represented in ASCII characters, it can be more difficult. You have
    to think about how nasalized vowels, for example, might be written--or in
    some cases, not written. (Unicode is nice when it is used, but more often
    than not, it isn't.)

    I've thought about writing up our experiences in compiling archives for
    smaller languages, but I'm not sure what a good forum would be. And I
    probably don't have a good handle on what has already been published on this

        Mike Maxwell
        Linguistic Data Consortium

