[Corpora-List] Announcement: Leipzig Corpora Collection

From: Leipzig Corpora Collection (korpus@informatik.uni-leipzig.de)
Date: Mon Sep 18 2006 - 12:05:27 MET DST

  • Next message: Mohand-Said Hacid: "[Corpora-List] OTM 2006 : Call For Participation"

    The Leipzig Corpora Collection presents corpora in different languages using
    the same format and comparable sources. The following Languages are included:
    Catalan, Danish, Dutch, English, Estonian, Finnish, French, German, Italian,
    Japanese, Korean, Norwegian, Sorbian, Swedish, and Turkish.

    There is an online interface at http://corpora.uni-leipzig.de/ . Moreover, all
    data are available as plain text and as MySQL database tables for various
    applications. The corpora are ready to use with the Corpus Browser, see
    http://corpora.uni-leipzig.de/download.html . The corpora are intended both for
    scientific use by the corpus linguist as well as for applications such as
    knowledge extraction programs.

    The corpora are identical in format and similar in size and content. They
    contain randomly selected sentences in the language of the corpus and are
    available in sizes of 100,000 sentences, 300,000 sentences, 1 million sentences
    etc. The sources are either newspaper texts or texts randomly collected from
    the web. The texts are split into sentences. Non-sentences and foreign language
    material was removed.

    As the order of sentences is scrambeled, these data are not helpful in tasks
    that go beyond sentence boundaries. But this design helps us to overcome
    copyright issues, as documents are not reconstructible from the corpora
    provided and single sentences are not protected by copyright.

    Because the information which words co-occur with each other is useful for many
    applications, these data ware precomputed and included as well. For each word,
    the most significant words appearing

    a) as immediate left neighbour

    b) as immediate right neighbour

    c) anywhere within the same sentence

    are given. The quality of such co-occurrence increases with the corpus size, so
    we refer to forthcoming larger corpora.

    The authors will add larger corpora and new languages soon. The Leipzig Corpora
    Collection is also open to include other existing corpora in collaboration with
    the corresponding owners.

    Please contact: korpus@informatik.uni-leipzig.de



    This archive was generated by hypermail 2b29 : Mon Sep 18 2006 - 12:03:16 MET DST