[Corpora-List] SemEval-2007 -- Task #11: English Lexical Sample Task via English-Chinese Parallel Text

From: Ng Hwee Tou (dcsnght@nus.edu.sg)
Date: Sat Nov 18 2006 - 17:53:31 MET

  • Next message: pincemin@lli.univ-paris13.fr: "[Corpora-List] 2nd CFP (extended deadline): Interpretation, Contexts, Encoding"

    Task #11: English Lexical Sample Task via English-Chinese Parallel Text

     

    Updated on Nov 15, 2006 (** NEW **)

     

    Call for Interest in Participation

     

    http://www.comp.nus.edu.sg/~chanys/SemEval-2007.htm

    http://nlp.cs.swarthmore.edu/semeval/interest.shtml

     

    Feedback requested by Dec 1, 2006

     

     

    Organizers

     

    Hwee Tou Ng and Yee Seng Chan

    National University of Singapore

     

    Summary

     

    We propose an English lexical sample task for word sense

    disambiguation (WSD), where the sense-annotated examples are

    (semi)-automatically gathered from word-aligned English-Chinese

    parallel texts. After assigning appropriate Chinese translations to

    each sense of an English word, the English side of the parallel texts

    can then serve as the training data, as they are considered to have

    been disambiguated and "sense-tagged" by the appropriate Chinese

    translations.

     

    For more details, please refer to the full description for this task

    and the references given.

     

    Full Description

     

    First, English-Chinese parallel texts are automatically

    word-aligned. Then the correct Chinese translations corresponding to

    the different WordNet 1.7.1 senses of an English word are manually

    selected. Finally, the English half of the parallel texts (the

    ambiguous English word and its 3-sentence contexts) are used as the

    training and test material to set up an English lexical sample task.

     

    Since more than one English word sense may be translated by the same

    Chinese word, two or more English senses s1, s2, ..., sk may be

    collapsed into one sense in such cases. This gives rise to a lumped

    sense (coarser-grained) evaluation.

     

    We found from our past work that such an approach of acquiring

    training examples can yield sense-tagged data of high quality (at

    least as good as the quality of sense-tagged data for nouns collected

    in Senseval3 English lexical sample task).

     

    This proposed task is thus similar to the multilingual lexical sample

    task in Senseval3, except that the training and test examples are

    collected without manually annotating each individual ambiguous word

    occurrence.

     

    Datasets and Formats (** NEW **)

     

    We have two tracks for this task, each track using a different

    corpus. The first corpus is the following English-Chinese parallel

    corpus available from the Linguistic Data Consortium (LDC):

     

    LDC2005T10 Chinese English News Magazine Parallel Text

     

    It will be used for the evaluation of 50 English words (25 nouns and

    25 adjectives). Participants taking part in this track will need to

    have access to the above LDC corpus in order to access the training

    and test material in this track. Institutions that are LDC members can

    obtain the corpus by paying US$150. Institutions that are non-LDC

    members can obtain the corpus by paying US$2,000.

     

    Since not all interested participants may have access to the above LDC

    corpus, the second track of this task will make use of English-Chinese

    documents gathered from the URL pairs given by the STRAND Bilingual

    Databases. STRAND is a system that acquires document pairs in parallel

    translation automatically from the Web. We will be using this corpus

    for the evaluation of 40 English words (20 nouns and 20 adjectives).

     

    Participants in this task can choose to participate in one or both

    tracks.

     

    Evaluation

     

    The scorer will be the standard Senseval scorer.

     

    Download area

     

    This section will contain evaluation software, useful scripts,

    complementary materials, baseline systems, etc. but not the datasets

    proper. The datasets will be available at the main site for download.

     

    Systems and Results

     

    This section will be completed after the competition.

     

    References

     

    Chan, Yee Seng & Ng, Hwee Tou (2005). Scaling Up Word Sense

    Disambiguation via Parallel Texts. Proceedings of the 20th National

    Conference on Artificial Intelligence (AAAI

    2005). (pp. 1037-1042). Pittsburgh, Pennsylvania, USA.

     

    Ng, Hwee Tou, & Wang, Bin, & Chan, Yee Seng (2003). Exploiting

    Parallel Texts for Word Sense Disambiguation: An Empirical

    Study. Proceedings of the 41st Annual Meeting of the Association for

    Computational Linguistics (ACL-03). (pp. 455-462). Sapporo, Japan.

     

    Resnik, Philip & Smith, Noah A (2003). The Web as a Parallel

    Corpus. Computational Linguistics, Volume 29, Issue 3 (pp. 349-380).



    This archive was generated by hypermail 2b29 : Mon Nov 20 2006 - 10:01:24 MET