Corpora: Summary: Measures for similarity between two sentences

From: Constantin Orasan (
Date: Mon Nov 20 2000 - 16:43:46 MET

  • Next message: Alexander Koller: "Corpora: PhD Scholarships Saarbruecken/Edinburgh"

    Last week, I posted a message enquiring about measures of similarity
    between two sentences. I would like to thank to:
    - Christopher Brewster
    - Miles Osborne
    - Jennifer Spenader
    - Alexander Gelbukh
    - Kevin McTait
    - Patrick Ruch
    - Ken Litkowski
    - Barb Ball
    - Manuel Montes
    - Bill Fisher
    - Andreas Faatz
    for their answers and suggestions. Given that few people expressed their
    interest in this topic, here is a summary:


    In the PhD thesis "Collocational similarity : emergent patterns in
    lexical environments" by Paul Richard Hays, 1997, Birmingham, KWIC lines
    are compared. Maybe it can be addapted for comparing sentences.


    String edit distance can be used as a measure(sentences A and B are
    similar to C if A and B can be mapped to C using the same number of
    edits), but one could easily imagine another set of editing operations.
    The application for which the measure is used influences it very much.


    Papers which could help are:
    - Shieber, Stuart (1993). The Problem of Logical-Form Equivalence,
    Computational Linguistics, Vol 19, No. 1
    - Spenader, Jennifer (2000). Defining Propositional Similarity:
    Systemizing Identification of Presuppositional Binding. Proceedings of
    Götalog 2000, Fourth Workshop on the Semantics and Pragmatics of
    Dialogue, Göteborg University 15-17 June 2000.
    - Emmanuel Planas, MT Summit VII: 'Formalizing Translation Memory'
    - Manuel Montes-y-Gómez, Alexander Gelbukh, Aurelio López-López.
    Comparison of Conceptual Graphs. Proc. MICAI-2000, 1st Mexican
    International Conference on Artificial Intelligence, Acapulco, Mexico,
    April 2000. In: O. Cairo, L.E. Sucar, F.J. Cantu (eds.) MICAI 2000:
    Advances in Artificial Intelligence. Lecture Notes in Artificial
    Intelligence N 1793, ISSN 0302-9743, ISBN
    3-540-67354-7, Springer, pp. 548-556
    - Kenneth C. Litkowski, 1999, Towards a Meaning-Full Comparison of
    Lexical Resources, Proceeding of the Association for Computational
    Linguistics Special Interest Group on the Lexicon, June 21-22, College
    Park, MD
    - Andreas Faatz, Designing clustering methods for ontology building: The
    Mo K workbench

    Distance metric can be useful on different levels and it is likely to be
    applied on any material likely to be applied on any material (tokens,
    part-of.speech, word-sense). A good introduction, theoretical, practical
    didactic, can be found at:,

    Some (unix-like) c code can be downloaded here:


    ThemeScape software might be useful. It scans entire documents in search
    of similarity. They're at


    You can download from the NIST site
    ( some software called
    "aldistsm-1.2.tar.Z" which computes an alignment (edit) distance between
    two sentences, where the basic editing operations are changes in
    phonological features, including splits and merges on the word level.

    Constantin Orasan
    Computational Lingvistics Group
    University of Wolverhampton

    This archive was generated by hypermail 2b29 : Mon Nov 20 2000 - 16:45:16 MET