[Corpora-List] New LDC Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Thu Jan 05 2006 - 22:07:23 MET

  • Next message: Adam Kilgarriff: "[Corpora-List] Final CFP: 2nd WAC Workshop, at EACL"

    LDC2005T35
    *ANC Second Release
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35>*

    LDC2005T28
    *HARD 2004 Text
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T28>*

    LDC2005T29
    *HARD 2004 Topics and Annotations
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T29>*
    *
    *

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of three new publications.

    ------------------------------------------------------------------------

    *New LDC Publications*

    (1) The American National Corpus (ANC) project fosters the development
    of a corpus comparable to the British National Corpus (BNC), covering
    American English. Corpus-analytic work has demonstrated that the BNC is
    inappropriate for the study of American English, due to the numerous
    differences in use of the language.

    The availability of a corpus of American English will significantly
    contribute to language and linguistic research, the development of
    language understanding computer applications (e.g., language translation
    and search and retrieval software), and the compilation of reference
    works such as dictionaries and thesauri. It will also provide a rich
    national resource for use in education at all levels.

    ANC Second Release
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35>
    contains over 20 million words: 10+ million words added in the Second
    Release, and a new corrected and validated version of the 11 million
    word ANC First Release. The Second Release also contains software for
    searching and retrieving multiple stand-off annotations.

    ANC Second Release contains texts from the following sources (* denotes
    new source in the Second Release):

    Transcribed telephone speech (LDC and Project MORE)
    New York Times
    Berlitz Travel Guides (Langensheidt Publishers)
    Slate Magazine (Microsoft)
    ICIC Corpus of Fundraising Texts (Indiana Center for Intercultural
    Communication)*
    The Michigan Corpus of Academic Spoken English (MICASE) (University of
    Michigan, English Language Institute)*
    Various non-fiction
    Various fiction (Orin Hargraves, Ferd Eggan)*
    Various medical research articles (BioMed Central, Public Library of
    Science)*
    Anonymized Posts to the Phoenix Board/Buffistas.org*

    *NOTE:* The cost of the first 50 copies of this publication (not
    counting the copies distributed to LDC members) is covered by NSF Grant
    Number BCS-998009, and therefore free of charge to qualified
    researchers; a $30 shipping and handling fee applies. After these first
    50 copies are distributed, additional copies will be available for the
    nonmember fee of US$75.

    (2) The HARD 2004 Text
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T28>
    corpus contains source data for the 2004 TREC HARD (High Accuracy
    Retrieval from Documents) Evaluation. HARD 2004 was a track within the
    NIST Text REtrieval Conference (TREC), with the objective of achieving
    high accuracy retrieval from documents by leveraging additional
    information about the searcher and/or the search context, through
    techniques like passage retrieval and the use of targeted interaction
    with the searcher. The topics and annotations that correspond to this
    release are distributed as LDC2005T29, HARD 2004 Topics and Annotations.
    This corpus was created with support from the DARPA TIDES Program and LDC.

    HARD 2004 Text comprises eight English newswire and web text sources
    from January-December 2003. The sources are

    AFE: Agence France Presse - English
    APE: Associated Press Newswire
    CNE: Central News Agency Taiwan - English
    LAT: Los Angeles Times/Washington Post
    NYT: New York Times
    SLN: Salon.com
    UME: Ummah Press - English
    XIE: Xinhua News Agency - English

    (3) The HARD 2004 Topics and Annotations
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T29>
    corpus contains topics and annotations (clarification forms, responses
    and relevance assessments) for the 2004 TREC HARD (High Accuracy
    Retrieval from Documents) Evaluation. HARD 2004 was a track within the
    NIST Text REtrieval Conference (TREC), with the objective of achieving
    high accuracy retrieval from documents by leveraging additional
    information about the searcher and/or the search context, through
    techniques like passage retrieval and the use of targeted interaction
    with the searcher. The source data that corresponds to this release is
    distributed as LDC2005T28, HARD 2004 Text. This corpus was created with
    support from the DARPA TIDES Program and LDC.

    Three major annotation tasks are represented in this release: Topic
    Creation, Clarification Form Responses, and Relevance Assessment. Topics
    include a short title, query plus context, and a number of limiting
    parameters known as "metadata" which include targeted geographical
    region, target data domain or genre, and level of searcher expertise.
    Clarification Forms are brief HTML questionnaires system developers
    submitted to LDC searchers to glean additional information about
    information needs directly from the topic creators. Relevance assessment
    consisted of adjudication of pooled system responses, and included
    document-level judgments for all topics, and passage-level relevance
    judgments for a subset of topics.

    The release is divided into training and evaluation resources. The
    training set comprises twenty-one topics and 100 document-level
    relevance judgments per topic. The evaluation set contains fifty topics,
    clarification forms and responses, document-level relevance assessment
    for all topics and passage-level judgments for half of the topics
    assessments.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    1275.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Thu Jan 05 2006 - 22:37:15 MET