[Corpora-List] News from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Thu Nov 30 2006 - 23:33:30 MET

  • Next message: Elena Grudeva: "[Corpora-List] web-site"

    ------------------------------------------------------------------------
    *
    40,000th LDC Corpus Distributed!*

    In 2003, the LDC celebrated its tenth anniversary and the distribution
    of our 15,000th corpus. At that time, the LDC recognized the continued
    support of its constituent members by offering a free membership to the
    university which had licensed the 15,000th corpus. Three short years and
    many requests for data later, we are excited to have recently
    distributed our 40,000th corpus! We would like to thank all
    organizations which have licensed data for helping the LDC reach this
    landmark distribution. The growing demand for LDC data from over 2000
    organizations supports our mission to develop and share resources for
    research in linguistic technologies. At the increased rate that we are
    distributing corpora, we anticipate the swift observance of our 50,000th
    distribution. Stay tuned...

    *New Publications

    *

    (1) French Gigaword First Edition
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17>
    is a comprehensive archive of newswire text data that has been acquired
    over several years by the Linguistic Data Consortium (LDC) at the
    University of Pennsylvania.

    The two distinct international sources of French newswire in this
    edition, and the time spans of collection covered for each, are as follows:

        * Agence France-Presse (afp_fre) May 1994 - July 2006
        * Associated Press French Service (apw_fre) Nov 1994 - July 2006

    The overall totals for each source are summarized below. Note that the
    "Totl-MB" numbers show the amount of data you get when the files are
    uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB"
    column shows totals for compressed file sizes as stored on the DVD-ROM;
    the "K-wrds" numbers are simply the number of whitespace-separated
    tokens (of all types) after all SGML tags are eliminated.

    Source #Files Gzip-MB Totl-MB K-wrds #DOCs
    AFP_FRE 147 1139 3445 482904 1797139
    APW_FRE 141 389 1167 167405 622740
    TOTAL 288 1528 4612 650309 2419879

    French Gigaword First Edition is distributed on one DVD-ROM.

    *

    (2) Iraqi Arabic Conversational Telephone Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S45>
    contains 276 Iraqi Arabic speakers taking part in spontaneous telephone
    conversations in Colloquial Iraqi Arabic. A total of 976 conversation
    sides are provided (one speaker appears on two distinct calls). The
    average duration per side is about 6 minutes.

    This corpus was collected and transcribed in 2003 and 2004 by Appen Pty
    Ltd, Sydney, Australia. Iraqi Arabic Conversational Telephone Speech is
    distributed on one DVD-ROM.*
    *

    *

    (3) Iraqi Arabic Conversational Telephone Speech, Transcripts
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T16>
    contains 276 Iraqi Arabic speakers taking part in spontaneous telephone
    conversations in Colloquial Iraqi Arabic. A total of 976 conversation
    sides are provided (one speaker appears on two distinct calls). The
    average duration per side is about 6 minutes. This corpus was collected
    and transcribed in 2003 and 2004 by Appen Pty Ltd, Sydney, Australia.
    Iraqi Arabic Conversational Telephone Speech, Transcripts is distributed
    via web download.*
    *

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    1275.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    University of Pennsylvania Fax: (215) 573-2175
    3600 Market St., Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 USA http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Fri Dec 01 2006 - 00:04:27 MET