[Corpora-List] New LDC Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Thu Jan 05 2006 - 22:07:23 MET

The Linguistic Data Consortium (LDC) is pleased to announce the
availability of three new publications.

------------------------------------------------------------------------

*New LDC Publications*

(1) The American National Corpus (ANC) project fosters the development
of a corpus comparable to the British National Corpus (BNC), covering
American English. Corpus-analytic work has demonstrated that the BNC is
inappropriate for the study of American English, due to the numerous
differences in use of the language.

The availability of a corpus of American English will significantly
contribute to language and linguistic research, the development of
language understanding computer applications (e.g., language translation
and search and retrieval software), and the compilation of reference
works such as dictionaries and thesauri. It will also provide a rich
national resource for use in education at all levels.

ANC Second Release
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35>
contains over 20 million words: 10+ million words added in the Second
Release, and a new corrected and validated version of the 11 million
word ANC First Release. The Second Release also contains software for
searching and retrieving multiple stand-off annotations.

ANC Second Release contains texts from the following sources (* denotes
new source in the Second Release):

Transcribed telephone speech (LDC and Project MORE)
New York Times
Berlitz Travel Guides (Langensheidt Publishers)
Slate Magazine (Microsoft)
ICIC Corpus of Fundraising Texts (Indiana Center for Intercultural
Communication)*
The Michigan Corpus of Academic Spoken English (MICASE) (University of
Michigan, English Language Institute)*
Various non-fiction
Various fiction (Orin Hargraves, Ferd Eggan)*
Various medical research articles (BioMed Central, Public Library of
Science)*
Anonymized Posts to the Phoenix Board/Buffistas.org*

*NOTE:* The cost of the first 50 copies of this publication (not
counting the copies distributed to LDC members) is covered by NSF Grant
Number BCS-998009, and therefore free of charge to qualified
researchers; a $30 shipping and handling fee applies. After these first
50 copies are distributed, additional copies will be available for the
nonmember fee of US$75.

(2) The HARD 2004 Text
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T28>
corpus contains source data for the 2004 TREC HARD (High Accuracy
Retrieval from Documents) Evaluation. HARD 2004 was a track within the
NIST Text REtrieval Conference (TREC), with the objective of achieving
high accuracy retrieval from documents by leveraging additional
information about the searcher and/or the search context, through
techniques like passage retrieval and the use of targeted interaction
with the searcher. The topics and annotations that correspond to this
release are distributed as LDC2005T29, HARD 2004 Topics and Annotations.
This corpus was created with support from the DARPA TIDES Program and LDC.

HARD 2004 Text comprises eight English newswire and web text sources
from January-December 2003. The sources are

AFE: Agence France Presse - English
APE: Associated Press Newswire
CNE: Central News Agency Taiwan - English
LAT: Los Angeles Times/Washington Post
NYT: New York Times
SLN: Salon.com
UME: Ummah Press - English
XIE: Xinhua News Agency - English

(3) The HARD 2004 Topics and Annotations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T29>
corpus contains topics and annotations (clarification forms, responses
and relevance assessments) for the 2004 TREC HARD (High Accuracy
Retrieval from Documents) Evaluation. HARD 2004 was a track within the
NIST Text REtrieval Conference (TREC), with the objective of achieving
high accuracy retrieval from documents by leveraging additional
information about the searcher and/or the search context, through
techniques like passage retrieval and the use of targeted interaction
with the searcher. The source data that corresponds to this release is
distributed as LDC2005T28, HARD 2004 Text. This corpus was created with
support from the DARPA TIDES Program and LDC.

Three major annotation tasks are represented in this release: Topic
Creation, Clarification Form Responses, and Relevance Assessment. Topics
include a short title, query plus context, and a number of limiting
parameters known as "metadata" which include targeted geographical
region, target data domain or genre, and level of searcher expertise.
Clarification Forms are brief HTML questionnaires system developers
submitted to LDC searchers to glean additional information about
information needs directly from the topic creators. Relevance assessment
consisted of adjudication of pooled system responses, and included
document-level judgments for all topics, and passage-level relevance
judgments for a subset of topics.

The release is divided into training and evaluation resources. The
training set comprises twenty-one topics and 100 document-level
relevance judgments per topic. The evaluation set contains fifty topics,
clarification forms and responses, document-level relevance assessment
for all topics and passage-level judgments for half of the topics
assessments.

------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
1275.

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu

Next message: Adam Kilgarriff: "[Corpora-List] Final CFP: 2nd WAC Workshop, at EACL"
Previous message: Crowdy, Steve: "[Corpora-List] Employment opportunity at Longman Dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Jan 05 2006 - 22:37:15 MET