RE: [Corpora-List] Legal aspects of compiling corpora

From: Khalid CHOUKRI (choukri@elda.fr)
Date: Thu Jun 19 2003 - 14:10:18 MET DST

  • Next message: Ngoni Chipere: "[Corpora-List] Rare Words"

    At Thursday 19/06/2003 10:26(), Sampo Nevalainen wrote:
    >Hi,
    >
    >>then we will face another problem of comparing approaches and techniques,
    >>if each of us use different corpora (without any possibility to share it
    >>with others because of the legal aspects) then no comparison will be possible.
    >
    >My comment is clearly out of topic, but I could not resist... This is one
    >thing I have not fully understood ever since I was irrevocably taken with
    >CL. Many text books on CL give an idea that a corpus should have a finite
    >size and be "a standard reference" (as McEnery and Wilson put it in
    >"Corpus Linguistics" 1996). In my humble opinion, this is rather
    >unnatural, as, after all, we are studying an open, ever-growing, dynamic,
    >lively organism (unless we are interested in "dead" languages). From this
    >viewpoint, if we are going to generalize anything about a language, at
    >least I would have more confidence in results that are based on several
    >different corpora rather than on a detailed description of a certain
    >corpus. Just as weather forecasts or climate studies -- the more
    >measurement points are available the more reliable they are. (Clearly, one
    >practical solution is a kind of "monitor corpus" -- or the Internet. I
    >understand that the cruciality of this question depends a lot on the
    >purpose(s) of the corpus and the aim(s) of the researcher, which, I think,
    >should be convergent to some extent.) Of course, the other side of the
    >coin is economy. It would be a huge waste of money and resources if
    >everybody should compile corpora of their own - and preferably non-stop!
    >
    >sincerely
    >Sampo

    Dear Sampo

    since you mentioned weather forecast I am sure you understand when I say
    that today it is 19° here in Paris, looking at Cnn it says that it is 67.
    If we do not share the same scale there is no comparisons. For corpora we
    saw that evaluating taggers (as an example) people may announce they are
    85% accurate, others may pretend that their algorithmes outperform these
    and achieve 90%. The only possibility is to share the corpora and the metrics.

    But of course the corpora should grow and be updated regularly (and then we
    face the economic and financial issues you pointed out).

    So I am sure there is a need for a common corpora for as many languages as
    possible. But "common" should also take into consideration the
    requirements of a lot of researchers to come up with a consensus on their
    needs.

    You may want to look at the description of the BLARK concept (Basic
    Language Resources Kit)
    (see at http://www.elda.fr) under > Projects > European &
    International (http://www.elda.fr/article.php3?id_article=48)
    or at the report drafted in the framework of the European project ENABLER
    (European National Activities for Basic Language Resources,
    Thematic Network, Deliverable D5.1: Report on a (minimal) set of LRs to be
    made available for as many languages as possible, and map of the actual gaps)

    (at http://www.enabler-network.org/reports.htm report D5.1)

    All the best
    Khalid

    *************************************************************
    Khalid CHOUKRI mailto:choukri@elda.fr
    ELRA CEO
    Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30
    Postal Mail: 55 Rue Brillat-Savarin, 75013 Paris France
    Home page: http://www.elda.fr/ or http://www.elra.info/
    LREC news: http://www.lrec-conf.org/
    ***************************************************************



    This archive was generated by hypermail 2b29 : Thu Jun 19 2003 - 14:16:35 MET DST