[Corpora-List] Trial Dataset available for Task4 at SemEval-2007

From: Roxana Girju (rox_fox11@yahoo.com)
Date: Wed Jan 03 2007 - 18:05:59 MET

  • Next message: Nabil Hathout: "[Corpora-List] TALN 2007 Second Call for Papers"

    [Apologies for duplications]
    [Please distribute widely]

    The trail dataset for SemEval-2007 Task 4
    (Classification of Semantic Relations between
    Nominals) is now available. Interested participants
    can access the three main files from the following
    websites:

    (1) Guide for Use of Trial Dataset,
    http://docs.google.com/View?docID=w.df735kg3_5ckpbv6

    (2) Relation 7: Content-Container,
    http://docs.google.com/View?docID=w.df735kg3_3gnrv95

    (3) Relation 7: Training Data,
    http://docs.google.com/View?docID=w.df735kg3_8gt4b4c

    Participants and prospective participants in Task 4
    are invited to post their comments, questions, and
    contributions on the Semantic Relations group at
    http://groups.google.com/group/semanticrelations.

    The task description is listed below and can also be
    found at
    http://nlp.cs.swarthmore.edu/semeval/tasks/task4/description.shtml.

    The Task4 organizers

    ---------------------

    SemEval-2007
    Task #4: Classification of Semantic Relations between
    Nominals

    Roxana Girju1, Marti Hearst2, Preslav Nakov3, Vivi
    Nastase4, Stan Szpakowicz5, Peter Turney6, Deniz
    Yuret7

    July 28, 2006

    1. Department of Linguistics, University of Illinois
    at Urbana-Champaign, girju@cs.uiuc.edu
    2. School of Information, University of California,
    Berkeley, hearst@sims.berkeley.edu
    3. Department of Electrical Engineering and Computer
    Science, University of California, Berkeley,
    nakov@cs.berkeley.edu
    4. School of Information Technology and Engineering,
    University of Ottawa, vnastase@site.uottawa.ca
    5. School of Information Technology and Engineering,
    University of Ottawa, szpak@site.uottawa.ca
    6. Institute for Information Technology, National
    Research Council of Canada,
    peter.turney@nrc-cnrc.gc.ca
    7. Department of Computer Engineering, Koc University,
    dyuret@ku.edu.tr

    1.Description of the Task

    There is growing interest in the task of classifying
    semantic relations between pairs of words. However,
    many different classification schemes have been used,
    which makes it difficult to compare the various
    classification algorithms. We will create a benchmark
    dataset and evaluation task that will enable
    researchers to compare their algorithms.

    Rosario and Hearst (2001) classify noun-compounds from
    the medical domain, using a set of 13 classes that
    describe the semantic relation between the head noun
    and the modifier in a given noun-compound. Rosario et
    al. (2002) classify noun-compounds using a multi-level
    hierarchy of semantic relations, with 15 classes at
    the top level. Nastase and Szpakowicz (2003) present a
    two-level hierarchy for classifying noun-modifier
    relations in general domain text, with 5 classes at
    the top and 30 classes at the bottom. Their class
    scheme and dataset have been used by other researchers
    (Turney and Littman, 2005; Turney, 2005; Nastase et
    al., 2006). Moldovan et al. (2004) use a 35-class
    scheme to classify relations in noun phrases. The same
    scheme has been applied to noun compounds (Girju et
    al., 2005). Chklovski and Pantel (2004) use a 5-class
    scheme, designed specifically for characterizing
    verb-verb semantic relations. Stephens et al. (2001)
    use a 17-class scheme created for relations between
    genes. Lapata (2002) uses a 2-class scheme for
    classifying relations in nominalizations.

    Algorithms for classifying semantic relations have
    potential applications in Information Retrieval,
    Information Extraction, Summarization, Machine
    Translation, Question Answering, Paraphrasing,
    Recognizing Textual Entailment, Thesaurus
    Construction, Semantic Network Construction, Word
    Sense Disambiguation, and Language Modeling. As the
    techniques for semantic relation classification
    mature, some of these applications are being tested.
    Tatu and Moldovan (2005) applied the 35-class scheme
    of Moldovan et al. (2004) to the PASCAL Recognizing
    Textual Entailment (RTE) challenge, obtaining
    significant improvement in a state-of-the-art
    algorithm.

    There is no consensus on schemes for classifying
    semantic relations, and it seems unlikely that any
    single scheme could be useful for all applications.
    For example, the gene-gene relation scheme of Stephens
    et al. (2001) includes relations such as "X
    phosphorylates Y", which are not very useful for
    general domain text. Even if we focus on general
    domain text, the verb-verb relations of Chklovski and
    Pantel (2004) are unlike the noun-modifier relations
    of Nastase and Szpakowicz (2003) or the noun phrase
    relations of Moldovan et al. (2004).

    We will create a benchmark dataset for evaluating
    semantic relation classification algorithms, embracing
    several different existing classification schemes,
    instead of attempting the daunting chore of creating a
    single unified standard classification scheme. We will
    treat each semantic relation separately, as a single
    two-class (positive negative) classification task,
    rather than taking a whole N class scheme of relations
    as an N class classification task (Nastase and
    Szpakowicz, 2003).

    To constrain the scope of the task, we have chosen a
    specific application for semantic relation
    classification, relational search (Cafarella et al.,
    2006). We describe this application in Section 2. The
    application we envision is a kind of search engine
    that can answer queries such as "list all X such that
    X causes asthma" (Girju, 2001). Given this
    application, we have decided to focus on semantic
    relations between nominals (i.e., nouns and base noun
    phrases, excluding named entities).

    The dataset for the task will consist of annotated
    sentences. We will select a sample of relation classes
    from several different classification schemes and then
    gather sentences from the Web using a search engine.
    We will manually markup the sentences, indicating the
    nominals and their relations. Algorithms will be
    evaluated by their average classification performance
    over all of the sampled relations, but we will also be
    able to see whether some relations are more difficult
    to classify than others, and whether some algorithms
    are best suited for certain types of relations.

    2.Application Example: Relational Search

    For some of the tasks that we mention in Section 1, it
    might be argued that semantic relation classification
    plays a supporting role, rather than a central role.
    We believe that semantic relation classification is
    central in relational search. Cafarella et al. (2006)
    describe four types of relational search tasks.
    Although they focus on relations between named
    entities, the same kinds of tasks would be interesting
    for nominals. For example, consider the task of making
    a list of things that have a given relation with some
    constant thing:

        * list all X such that X causes cancer
        * list all X such that X is part of an automobile
    engine
        * list all X such that X is material for making a
    ship's hull
        * list all X such that X is a type of
    transportation
        * list all X such that X is produced from cork
    trees

    For these kinds of relational search tasks, we do not
    need a complete, exhaustive, non-overlapping set of
    classes of semantic relations. Each class, such as X
    causes Y, can be treated as a single binary
    classification problem. Any algorithm that performs
    well on the dataset (Section 4) and task (Section 5)
    described here should be directly applicable to
    relational search applications.

    3.Semantic Relations versus Semantic Roles

    We should note that classifying semantic relations
    between pairs of words is different in several ways
    from automatic labeling of semantic roles (Gildea and
    Jurafsky, 2002), which was one of the tasks in
    Senseval-3.1 Semantic roles involve frames with many
    slots, but our focus is on relations between pairs of
    words; semantic roles are centered on verbs and their
    arguments, but semantic relations include pairwise
    relations between all parts of speech (although we
    limit our attention to nominals in this task, to keep
    the task manageable); FrameNet2 currently contains
    more than 8,900 lexical units, but none of the schemes
    discussed above contain more than 50 classes of
    semantic relations.

    Each slot in a frame might be considered as a binary
    relation, but FrameNet and PropBank3 do not make a
    consistent effort to assign the same labels to similar
    slots. In FrameNet, for example, the verb "sell"
    ("Commerce_sell") has core slots "buyer", "goods", and
    "seller", whereas the verb "give" ("Giving") has core
    slots "donor", "recipient", and "theme". There is no
    matching of the similar slots, although they have very
    similar semantic relations:

    <sell, buyer> &#8596; <give, recipient>

    <sell, goods> &#8596; <give, theme>

    <sell, seller> &#8596; <give, donor>

    Semantic relation classification schemes generalize
    relations across wide groups of verbs (Chklovski and
    Pantel, 2004) and include relations that are not
    verb-centered (Nastase and Szpakowicz, 2003; Moldovan
    et al., 2004). Using the same labels for similar
    semantic relations facilitates supervised learning.
    For example, a learner that has been trained with
    examples of "sell" relations should be able to
    transfer what it has learned to "give" relations.

    4.Generating Training and Testing Data

    Nastase and Szpakowicz (2003) manually labeled 600
    noun-modifier pairs. Each pair was assigned one of
    thirty possible labels. To facilitate classification,
    each word in a pair was also labeled with its part of
    speech and its synset number in WordNet4. When
    classifying relations in noun phrases, Moldovan et al.
    (2004) provided their annotators with an example of
    each phrase in a sentence. We will include parts of
    speech, synset numbers, and a sample sentence for each
    pair in our training and testing data.

    Consider the noun-modifier pair "silver ship", in
    which the head noun "ship" is modified by the word
    "silver". Using the classification scheme of Nastase
    and Szpakowicz (2003), the semantic relation in this
    pair might be classified as material (the ship is made
    of silver), purpose (the ship was built for carrying
    silver), or content (the ship contains silver). Note
    that parts of speech and WordNet synsets are not
    sufficient to determine which of these three classes
    is intended. If "silver" is labeled "a1" (adjective,
    WordNet sense number 1, "made from or largely
    consisting of silver") and "ship" is labeled "n1"
    (noun, WordNet sense number 1, "a vessel that carries
    passengers or freight"), then the correct class must
    be material. However, if "silver" is labeled "n1"
    (noun, WordNet sense number 1, "a soft white precious
    univalent metallic element"), then the correct class
    could be either purpose or content. We would represent
    this example as follows:

    "The <e1>silver</e1> <e2>ship</e2> usually carried
    silver bullion bars, but sometimes the cargo was gold
    or platinum." WordNet(e1) = "n1", WordNet(e2) = "n1",
    Relation(e1, e2) = "content".

    We will begin by choosing seven relations from several
    different schemes (e.g., Nastase and Szpakowicz, 2003;
    Moldovan et al., 2004). We will focus on relations
    between nominals. For the purposes of this exercise,
    we define a nominal as a noun or base noun phrase,
    excluding named entities. A base noun phrase is a noun
    and its premodifiers (e.g., nouns, adjectives,
    determiners). We do not include complex noun phrases
    (e.g., noun phrases with attached prepositional
    phrases). For example, "lawn" is a noun, "lawn mower"
    is a base noun phrase, and "the engine of the lawn
    mower" is a complex noun phrase. The markup will
    explicitly identify entity boundaries, so the teams
    that attempt this task will not need to worry about
    finding entity boundaries (e.g., in "<e1>macadamia
    nuts</e1> in the <e2>cake</e2>", we can see that the
    first entity is a base noun phrase and the second
    entity is a noun; there will be no need for a chunking
    parser).

    For each of the chosen relations, we will give a
    precise definition of the relation and some
    prototypical examples. The definitions and examples
    will be available to the annotators and will be
    included in the distribution of the training and
    testing data.

    Given a specific relation (e.g., content), we will use
    heuristic patterns to search in a large corpus for
    sentences that illustrate the given relation. For
    example, for the relation content, we may use Google
    to search the Web, using queries such as "contains",
    "holds", "the * in the". For each relation, we will
    use several different search patterns, to ensure a
    wide variety of example sentences. The search patterns
    will be manually constructed, using the approach of
    Hearst (1992).

    The collected sentences will be given to two
    annotators, who will create positive and negative
    training examples from the sentences. For example,
    here are positive and negative examples for content
    (below, "!=" means "does not equal"):

    "The <e1>macadamia nuts</e1> in the <e2>cake</e2> also
    make it necessary to have a very sharp knife to cut
    through the cake neatly." WordNet(e1) = "n2",
    WordNet(e2) = "n3", Relation(e1, e2) = "content".

    "Summer was over and he knew that the <e1>climate</e1>
    in the <e2>forest</e2> would only get worse."
    WordNet(e1) = "n1", WordNet(e2) = "n1", Relation(e1,
    e2) != "content".

    The negative example above would be classified as
    location in the scheme of Nastase and Szpakowicz
    (2003). The use of heuristic patterns to search for
    positive and negative training examples should
    naturally result in negative examples that are near
    misses. We believe that near misses are more useful
    for supervised learning than negative examples that
    are generated purely randomly.

    Each example will be independently labeled by two
    annotators. When the annotation is completed, the
    annotators will compare their labels and make a note
    of the number of cases in which they agree and
    disagree. If the annotators cannot come to a consensus
    on the correct labels for a particular example, that
    example will not be included in the training and
    testing data, although it will be recorded for
    possible future analysis.

    This method of generating training and testing data is
    designed with relational search in mind (Section 2). A
    natural approach to relational search is to use
    heuristic patterns (Hearst, 1992) with a conventional
    search engine, and then use supervised learning
    (Moldovan et al., 2004) to filter the resulting noisy
    text.

    We will follow the model of the Senseval-3 English
    Lexical Sample Task, which had about 140 training and
    70 testing samples per word. The following list
    summarizes the main features of our dataset:

        * 7 semantic relations (not exhaustive and
    possibly overlapping)
        * 140 training sentences per relation (7 × 140 =
    980 training sentences)
        * 70 testing sentences per relation (7 × 70 = 490
    testing sentences)
        * 210 combined testing and training sentences per
    relation (7 × 210 = 1,470 sentences)
        * sentence classes will be approximately 50%
    positive and 50% negative (roughly 735 positive and
    735 negative, for a total of 1,470 sentences)
        * several different search patterns will be used
    for each semantic relation, to avoid biasing the
    sample sentences
        * negative examples of a relation will be "near
    misses"

    Since most of the words in the Senseval-3 English
    Lexical Sample Task had more than two senses, we will
    have more samples per class (positive and negative)
    per relation than the average word in the English
    Lexical Sample Task had samples per sense per word.

    For each relation, one person will retrieve sample
    sentences and two other people will annotate the
    sentences. To encourage debate, the three people will
    be chosen from three different institutions. A
    detailed guide will be prepared, to maximize the
    agreement between annotators.

    5.Evaluation Methodology

    As with the Senseval-3 Lexical Sample tasks, each team
    participating in this task will initially have access
    only to the training data. Later, the teams will have
    access to unlabeled testing data (that is, there will
    be WordNet labels, but no Relation labels). The teams
    will enter their algorithms' guesses for the labels
    for the testing data. When SemEval-1 is over, the
    labels for the testing data will be released to the
    public.

    Algorithms will be allowed to skip examples that they
    cannot classify. An algorithm's score for a given
    relation will be the F score, the harmonic mean of
    precision and recall. Algorithms will be ranked
    according to their average F scores for the chosen set
    of relations. We will also analyze the results to see
    which relations are most difficult to classify. To
    assess the effect of varying quantities of training
    data, we will ask the teams to submit several sets of
    guesses for the labels for the testing data, using
    varying fractions of the training data.

    Some algorithms (e.g., corpus-based algorithms) may
    have no use for WordNet annotations. It might also be
    argued that the WordNet annotation is not practical in
    a real application. Therefore we will ask teams to
    indicate, when they submit their answers, whether
    their algorithms used the WordNet labels. We will
    group the submitted answers into those that used the
    WordNet labels and those that did not, and we will
    rank the answers in each group separately. Teams will
    be allowed to submit both types of answers, if their
    algorithms permit it.

    6.Copyright

    Our collected training and testing data, including all
    annotation, will be released under a Creative Commons
    License.

    __________________________________________________
    Do You Yahoo!?
    Tired of spam? Yahoo! Mail has the best spam protection around
    http://mail.yahoo.com



    This archive was generated by hypermail 2b29 : Wed Jan 03 2007 - 18:11:36 MET