Corpora: Summary: Measures for similarity between two sentences

From: Constantin Orasan (in6093@wlv.ac.uk)
Date: Mon Nov 20 2000 - 16:43:46 MET

  • Next message: Alexander Koller: "Corpora: PhD Scholarships Saarbruecken/Edinburgh"

    Last week, I posted a message enquiring about measures of similarity
    between two sentences. I would like to thank to:
    - Christopher Brewster
    - Miles Osborne
    - Jennifer Spenader
    - Alexander Gelbukh
    - Kevin McTait
    - Patrick Ruch
    - Ken Litkowski
    - Barb Ball
    - Manuel Montes
    - Bill Fisher
    - Andreas Faatz
    for their answers and suggestions. Given that few people expressed their
    interest in this topic, here is a summary:

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    In the PhD thesis "Collocational similarity : emergent patterns in
    lexical environments" by Paul Richard Hays, 1997, Birmingham, KWIC lines
    are compared. Maybe it can be addapted for comparing sentences.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    String edit distance can be used as a measure(sentences A and B are
    equally
    similar to C if A and B can be mapped to C using the same number of
    edits), but one could easily imagine another set of editing operations.
    The application for which the measure is used influences it very much.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Papers which could help are:
    - Shieber, Stuart (1993). The Problem of Logical-Form Equivalence,
    Computational Linguistics, Vol 19, No. 1
    - Spenader, Jennifer (2000). Defining Propositional Similarity:
    Systemizing Identification of Presuppositional Binding. Proceedings of
    Götalog 2000, Fourth Workshop on the Semantics and Pragmatics of
    Dialogue, Göteborg University 15-17 June 2000.
    - Emmanuel Planas, MT Summit VII: 'Formalizing Translation Memory'
    - Manuel Montes-y-Gómez, Alexander Gelbukh, Aurelio López-López.
    Comparison of Conceptual Graphs. Proc. MICAI-2000, 1st Mexican
    International Conference on Artificial Intelligence, Acapulco, Mexico,
    April 2000. In: O. Cairo, L.E. Sucar, F.J. Cantu (eds.) MICAI 2000:
    Advances in Artificial Intelligence. Lecture Notes in Artificial
    Intelligence N 1793, ISSN 0302-9743, ISBN
    3-540-67354-7, Springer, pp. 548-556
    - Kenneth C. Litkowski, 1999, Towards a Meaning-Full Comparison of
    Lexical Resources, Proceeding of the Association for Computational
    Linguistics Special Interest Group on the Lexicon, June 21-22, College
    Park, MD
    - Andreas Faatz, Designing clustering methods for ontology building: The
    Mo K workbench
                                       
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Distance metric can be useful on different levels and it is likely to be
    applied on any material likely to be applied on any material (tokens,
    part-of.speech, word-sense). A good introduction, theoretical, practical
    and
    didactic, can be found at:
    http://www-igm.univ-mlv.fr/~lecroq/seqcomp/index.html,

    Some (unix-like) c code can be downloaded here:
    http://odur.let.rug.nl/~kleiweg/levenshtein/

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    ThemeScape software might be useful. It scans entire documents in search
    of similarity. They're at www.cartia.com.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    You can download from the NIST site
    (http://www.nist.gov/speech/tools/index.htm) some software called
    "aldistsm-1.2.tar.Z" which computes an alignment (edit) distance between
    two sentences, where the basic editing operations are changes in
    phonological features, including splits and merges on the word level.

    ==============================
    Constantin Orasan
    Computational Lingvistics Group
    University of Wolverhampton
    http://www.wlv.ac.uk/~in6093



    This archive was generated by hypermail 2b29 : Mon Nov 20 2000 - 16:45:16 MET