[Corpora-List] Recent LDC Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Thu Aug 07 2003 - 21:44:58 MET DST

  • Next message: Maryam Abacha: "Maryam"

                                  LDC2003T12

                           * Arabic Gigaword *

                              * LDC2003V01 *

                       * FORM2 Kinematic Gesture *

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of two new releases.

    1. Arabic Gigaword is a comprehensive archive of newswire text data
    that has been acquired from Arabic news sources by the LDC. The
    newswire texts are drawn from four sources:

       Agence France Presse (afp)
       Al Hayat News Agency (alh)
       Al Nahar News Agency (ann)
       Xinhua News Agency (xin)

    Much of the Agence France Presse content in this collection has been
    published previously by the LDC in Arabic Newswire Part 1 (LDC2001T55).
       The entire Al Hayat, An Nahar and Xinhua Arabic content, as well as
    AFP content for 2001-2002, is previously unreleased material.

    Arabic Gigaword consists of 319 files, totaling approximately 1.1GB in
    compressed form (4348 MB uncompressed, and 391619 Kwords). All text
    files corpus have been converted to UTF-8 character encoding. Arabic
    Gigaword is distributed on DVD.

    For further information, including a link to online documentation,
    please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T12

    Institutions that have membership in the LDC during the 2003
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may license this publication for $2,500.

                                     *

    2. FORM is a gesture annotation scheme designed to capture the
    kinematic information in gesture from videos of speakers. FORM2
    Kinematic Gesture is a detailed database of gesture-annotated videos
    stored in the Anvil and FORM file formats. FORM encodes the "phonetics"
    of gesture by giving geometric descriptions of location and movement of
    the right and left arms. Other kinematic information such as effort and
    shape are also recorded.

    FORM2 Kinematic Gesture contains a total of 24 data files: 8 movie
    files, 8 Anvil files, and 8 Form files. The movie files represent 12
    minutes of audio and video recordings excerpted from a lecture given by
    Brian MacWhinney on January 24, 2000 at Carnegie Mellon University.
    These video recordings were chosen because they are part of the
    NSF-funded Talkbank project.

    For further information, including a link to the FORM website and online
    documentation, please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003V01

    The cost of the first 50 copies of this publication (not including the
    copies distributed to LDC members) is covered by sponsoring grants.
    These copies are, therefore, free of charge to qualified researchers;
    a $30 shipping and handling fee applies. After these first 50 copies
    are distributed, additional copies will be available for the production
    cost of $500 per CD-ROM.

                                     *

    If you need additional information before placing your order, or
    would like to inquire about membership in the LDC, please send email to
    <ldc@ldc.upenn.edu> or call (215) 573-1275.

    ---------------------------------------------------------------------
    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 email: ldc@ldc.upenn.edu
    Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Thu Aug 07 2003 - 21:49:51 MET DST