Re: [Corpora-List] Re: Minor(ity) Language

From: Chris Brew (cbrew@acm.org)
Date: Thu Mar 09 2006 - 16:14:47 MET

  • Next message: Michiel Kamermans: "[Corpora-List] does anyone know a good aligned sentence indexer"

    Whether a language gets worked on in corpus linguistics/NLP/computational
    linguistics depends on at least the
    following:

    - the number of people who speak it
    - the total income of people who speak it
    - the extent to which computational and/or lexical
       resources exist for it
    - the extent to which the people who hold the lexical
       resources make them conveniently available in ways that foster
       research. This also affects the nature of the research: people
       who want to run machine learning algorithms look for different
       kinds of access than those who want to see a small number of
       key examples presented in context
    - the level of governmental support, enthusiasm and funding
    - the extent to which researchers who choose to work on the
       language are loved and appreciated by the society.
    - whether language is a significant political issue and how
    - the potential scientific payoff of working on the languages
       in question.

    Given the number of dimensions involved (I'm sure the above is not
    exhaustive), I doubt if it makes any sense to draw hard decision
    boundaries between minority/majority, endangered/safe/hegemonic or
    indeed any other fixed set of terms. So when we write about our work,
    we'll just have to get used to including brief summaries of the
    relevant aspects of the language situation. Self evidently it is
    somehow different to study the Arabic of Dearborn, Michigan or the
    Spanish of emigre Puerto Ricans and Mexicans in Lorain County, Ohio
    than to study them in San Juan, Tijuana or Lebanon, but until we get
    to specifics we won't want to pick terms that describe the languages
    in a hard and fast way.

    Chris

    On Thu, Mar 09, 2006 at 09:36:06AM -0500, Ed Kenschaft wrote:
    > On 3/9/06, Nicholas Sanders <nick@semiotek.org> wrote:
    >> But the Polish and Icelandic examples don't fit the model,
    >> because they have no official status in the countries cited.
    >
    > Correct me if I'm wrong, but I don't think *any* language has official
    > status in the United States. Does that mean we don't have any
    > minority (or majority) languages?
    >
    > Still, you make a good point. A language that is clearly not a
    > minority language worldwide (e.g. Hindi) might well be a minority
    > language in a specific context. Thus complicating the terminology
    > still further.
    >
    > On 3/8/06, Mike Maxwell <maxwell@ldc.upenn.edu> wrote:
    >> On this side of the Atlantic, the term seems to be "low density
    >> languages" ...
    >
    > In my circle, the most common term might be "scarce-resource
    > languages". (We got tired of explaining to people that the meaning of
    > "low density" had nothing to do with density.) The term gets at the
    > idea that a language might be spoken by a lot of people, but still not
    > have a lot of computational resources available (e.g. Hindi, Urdu).
    >
    > Cheers.
    >
    > --
    > Ed Kenschaft
    > ekenschaft@gmail.com
    > www.umiacs.umd.edu/users/kensch/
    >



    This archive was generated by hypermail 2b29 : Sat Mar 11 2006 - 15:32:27 MET