[Corpora-List] German treebanks - new releases

From: Heike Zinsmeister (heike.zinsmeister@uni-tuebingen.de)
Date: Mon Nov 14 2005 - 09:45:34 MET

  • Next message: ELDA: "[Corpora-List] ELRA - Language Resources Catalogue - Update"

    The Division of Computational Linguistics at the Seminar fuer
    Sprachwissenschaft
    of the University of Tuebingen (Germany) is happy to announce the
    release of
    two German language resources:

    * The Tuebingen Treebank of Spoken German (TueBa-D/S)
    * The Tuebingen Treebank of Written German (TueBa-D/Z) - second release

    Both treebanks have the same basic annotation scheme which
    distinguishes four levels of syntactic constituency: the lexical level,
    the phrasal level, the level of topological fields, and the clausal level.
    In addition to constituent structure, annotated trees contain edge labels
    between nodes which encode grammatical functions.

    Both treebanks are available in 3 different formats:
       * NEGRA export format
       * XML format
       * Penn Treebank format

    The treebanks in detail:

    1. The Tuebingen Treebank of Spoken German (TueBa-D/S)

    The TueBa-D/S treebank was annotated in the project Verbmobil,
    a longterm Machine Translation project for spontaneous speech funded
    by the German Ministry for Education, Science, Research, and
    Technology (BMBF). This is the first public release of the treebank.

    TueBa-D/S is a syntactically annotated corpus based on spontaneous
    dialogues,
    which were manually transliterated. The treebank comprises approximately
    38 000 sentences (ca. 360 000 words). The syntactic annotation was also
    performed manually.

    The license for TueBa-D/S is granted free of charge for scientific use.
    For more information, please refer to:
    http://www.sfs.uni-tuebingen.de/en_tuebads.shtml

    2. The Tuebingen Treebank of Written German (TueBa-D/Z) - second release

    The TueBa-D/Z treebank is a manually annotated, German newspaper
    corpus based on data taken from the daily issues of the 'die tageszeitung'.
    It currently comprises approximately 22 000 sentences (ca. 380 000 words).

    The annotation scheme is an extended version of the TueBa-D/S annotation
    scheme. It accounts for a larger number of linguistic phenomena and is
    enriched at two levels: (multi-word) named entities are marked at the
    phrasal level;
    words are annotated with inflectional morphology at the lexical level
    (currently ca. 70% of the sentences are covered).

    What is new in the second release:

    - about 6 800 additional sentences
    - morphological information
    - cleaner versions of the trees published in the first release

    The license for TueBa-D/Z is granted free of charge for scientific use.
    For more information, please refer to:
    http://www.sfs.uni-tuebingen.de/en_tuebadz.shtml

    With best regards,

    Erhard W. Hinrichs
    Sandra Kübler
    Heike Zinsmeister
    -------------------------------------------------------

    For your information:

    A related resource is The Tuebingen Partially Parsed Corpus of
    Written German (TuePP-D/Z), released 12/2003.

    TuePP-D/Z is a 200 million word collection of articles from the taz
    newspaper
    which have been automatically annotated with clause structure,
    topological fields,
    and chunks, in addition to more low level annotation including parts of
    speech
    and morphological ambiguity classes.

    For more information, please refer to:
    http://www.sfs.uni-tuebingen.de/en_tuepp.shtml



    This archive was generated by hypermail 2b29 : Mon Nov 14 2005 - 10:26:13 MET