Re: [Corpora-List] Re: Minor(ity) Language

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Fri Mar 10 2006 - 00:19:06 MET

  • Next message: perlerin@info.unicaen.fr: "unsubscribe corpora"

    Briony Williams wrote:
    > This sounds similar to the BLARK concept ("Basic Language Resource
    > Kit"), which was proposed by Stephen Krauwer and developed by ELSNET and
    > ELRA. See http://www.elda.org/blark - quote: "in the framework of the
    > ENABLER thematic network ... ELDA elaborated a report defining a
    > (minimal) set of LRs to be made available for as many languages as
    > possible and mapping the actual gaps that should be filled in so as to
    > meet the needs of the HLT field.".
    > That website also contains "BLARK matrices", one per language, to be
    > filled in similarly to the LDC project described above.

    We (Chris Cieri and myself) presented our project at one of the early
    meetings discussing the BLARK a couple years ago (maybe it was the first
    one, I'm not sure). Our reason for setting the bounds on our own survey
    work (languages with >= 1M speakers, more or less binary decisions,
    etc.) were practical: we wanted to set a goal that we could achieve.
    And at that, we only got about half way through our list of languages.

    > However, there are differences:
    >
    > 1) BLARK covers speech resources also (not just text resources).
    > 2) BLARK does not set a minimum number of speakers for a
    > language (hence it can cover lesser-used languages as well).
    > 3) BLARK also includes "high-end" modules, e.g. syntactic parsers,
    > sentence generation).
    > 4) The BLARK matrix can be filled in with a greater degree of detail
    > than "yes/no" - i.e. "irrelevant", "important", "very
    > important", "essential".

    We left out virtually all the European languages, precisely because we
    felt we could rely on the European community to survey those
    languages--and also because it was obvious that most European languages
    were rapidly becoming at least "medium density" languages, if not high
    density, and our goal was to report on _low_ density languages. At the
    other end, we didn't try to cover languages with fewer than a million
    speakers, because we had to set a limit somewhere (even if it was an
    arbitrary limit) if we were to have a doable project. And the chances
    seemed very slim that a small language was going to have much in the way
    of resources. (There are fortunate exceptions, of course, but we would
    have spent a lot of time looking for them.)

    A couple questions in our survey, while having binary answers, were more
    along the "irrelevant/ essential" line (point (4) above). For instance,
    we asked whether the language had a complex inflectional morphology, by
    which we meant roughly "significantly more complex than English." The
    reason for asking that was that whether you needed to ask another
    question--if there was a morphological parser for the language--depended
    on the answer to the complex morphology question.

    As for not looking for syntactic parsers, our feeling was that this was
    a survey of _low_ density languages, so almost by definition the answer
    would be "no". (If no one has built a morphological parser for
    Tigrinya, then there won't be a syntactic parser.) The same point
    largely holds for speech resources, although that may be changing now.

    > The website asks researchers to fill in details for languages
    > which they have knowledge of - all languages, not only European
    > ones. This is a much-needed project and should be encouraged.

    I agree about the importance. It looks like the website has just Modern
    Standard Arabic at this point, unless I missed s.t. It would be great
    to expand this.

    As I say, I've tried several times to revive (funding for) the sort of
    survey we did at the LDC, with improvements. My feeling is that doing
    such a survey, and keeping it up-to-date, will require both training of
    multiple surveyors (I don't think it should be a two-person job, like
    ours was) and paying them to take the time do a good job (and to do
    updates).

    I have immense respect for open-source ventures like the wikipedia, but
    such projects are going to be hit-and-miss when it comes to languages:
    the wikipedia doesn't exist in 300 languages, and probably won't for a
    long time. OTOH, you can find some language resources (particularly
    monolingual text, and sometimes dictionaries) for a lot of low density
    languages, either because there is some commercial market for them
    (newspapers), or because it's a one-person labor of love (some
    dictionaries). But that's my personal opinion, and I would love to be
    proved wrong!

        Mike Maxwell
        CASL/ University of Maryland



    This archive was generated by hypermail 2b29 : Fri Mar 10 2006 - 00:19:59 MET