Re: [Corpora-List] Re: Minor(ity) Language

From: Sigrun Helgadottir (sigrunh@lexis.hi.is)
Date: Thu Mar 09 2006 - 09:50:09 MET

  • Next message: Nicholas Sanders: "Re: [Corpora-List] Re: Minor(ity) Language"

    The discussion about "minority languages" on this list puzzles me slightly.
    My understanding is that a "minority language" is a language spoken by a
    minority. In other words it describes a relative situation. Swedish is a
    minority language in Finland just as Finnish is a minority language in
    Sweden. Icelandic is mainly spoken by the 300 thousand or so inhabitants
    of Iceland but is certainly not a minority language there. However, it is a
    minority language in Canada for example where it is spoken by the
    descendants of Icelandic immigrants. Polish is not a minority language in
    Poland but it is a minority language in Iceland where it is spoken by
    Polish immigrants who make up about 1% of the population of Iceland.
    Sigrún Helgadóttir

    At 12:20 8.3.2006 -0500, Mike Maxwell wrote:
    >Chantal ENGUEHARD wrote:
    >>Note : [In 2004, vincent Berment defined in his thesis* an evaluation grid to
    >>note precisely what is the degree of computerization of any language. This
    >>grid allow to calculate a number (a note on a scale of 20 points).
    >>If this number is less than 10 points, the language is said to be a
    >>pi-language (pi being the greek letter p).
    >>If this number is more than 14 points, the language is said to be a
    >>tau-language (tau being the greek letter t).
    >>Otherwise the language is said to be a mu-language (mu being the greek letter
    >>m).]
    >
    >Reminds me of a project we (mostly Bill Poser and myself) did at the LDC a
    >few years back, in which we tried to quantify the resources available for
    >languages with at least a million speakers (of which the Ethnologue
    >reports something like 330). We looked on the web for things like 100k
    >words of monolingual and bilingual text, bilingual lexicons, morphological
    >parsers (where relevant), etc. We did _not_ try to quantify more high-end
    >things, such as syntactic parsers or MT programs (although we recorded
    >them if we found them). Everything was text-based: we did not look at
    >speech resources.
    >
    >A language was scored on each of these categories in a yes/no fashion. (It
    >would have been nice to say how much bilingual text there was, rather than
    >just more than or less than 100k words, but in many cases it's hard enough
    >to find the answer to the yes/no question.) We then did a spreadsheet,
    >with green for 'yes' in a given category, and red for 'no'. By assigning
    >numerical scores to various categories, we could easily sort the list of
    >languages.
    >
    >In the end, we only had time to do about 150 languages (intentionally
    >leaving out MSA, Mandarin Chinese, and most of the European languages,
    >even the minor(ity) ones). When we showed the results to people, they
    >thought it was the best thing since sliced bread. There are lots of ways
    >it could be improved if we did it again. Unfortunately, such a survey
    >quickly becomes out of date, and we have not found funding to revisit it.
    >
    >I'll have to see if I can get a copy of Berment's thesis...
    >
    > Mike Maxwell
    > CASL/ U of Maryland
    >



    This archive was generated by hypermail 2b29 : Thu Mar 09 2006 - 09:59:41 MET