Re: [Corpora-List] Re: Minor(ity) Language

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Wed Mar 08 2006 - 18:20:27 MET

  • Next message: radev@umich.edu: "Re: [Corpora-List] 'Standard European English' ?"

    Chantal ENGUEHARD wrote:
    > Note : [In 2004, vincent Berment defined in his thesis* an evaluation grid to
    > note precisely what is the degree of computerization of any language. This
    > grid allow to calculate a number (a note on a scale of 20 points).
    > If this number is less than 10 points, the language is said to be a
    > pi-language (pi being the greek letter p).
    > If this number is more than 14 points, the language is said to be a
    > tau-language (tau being the greek letter t).
    > Otherwise the language is said to be a mu-language (mu being the greek letter
    > m).]

    Reminds me of a project we (mostly Bill Poser and myself) did at the LDC
    a few years back, in which we tried to quantify the resources available
    for languages with at least a million speakers (of which the Ethnologue
    reports something like 330). We looked on the web for things like 100k
    words of monolingual and bilingual text, bilingual lexicons,
    morphological parsers (where relevant), etc. We did _not_ try to
    quantify more high-end things, such as syntactic parsers or MT programs
    (although we recorded them if we found them). Everything was
    text-based: we did not look at speech resources.

    A language was scored on each of these categories in a yes/no fashion.
    (It would have been nice to say how much bilingual text there was,
    rather than just more than or less than 100k words, but in many cases
    it's hard enough to find the answer to the yes/no question.) We then
    did a spreadsheet, with green for 'yes' in a given category, and red for
    'no'. By assigning numerical scores to various categories, we could
    easily sort the list of languages.

    In the end, we only had time to do about 150 languages (intentionally
    leaving out MSA, Mandarin Chinese, and most of the European languages,
    even the minor(ity) ones). When we showed the results to people, they
    thought it was the best thing since sliced bread. There are lots of
    ways it could be improved if we did it again. Unfortunately, such a
    survey quickly becomes out of date, and we have not found funding to
    revisit it.

    I'll have to see if I can get a copy of Berment's thesis...

        Mike Maxwell
        CASL/ U of Maryland



    This archive was generated by hypermail 2b29 : Wed Mar 08 2006 - 18:19:20 MET