Last week, I posted a message enquiring about measures of similarity
between two sentences. I would like to thank to:
- Christopher Brewster
- Miles Osborne
- Jennifer Spenader
- Alexander Gelbukh
- Kevin McTait
- Patrick Ruch
- Ken Litkowski
- Barb Ball
- Manuel Montes
- Bill Fisher
- Andreas Faatz
for their answers and suggestions. Given that few people expressed their
interest in this topic, here is a summary:
In the PhD thesis "Collocational similarity : emergent patterns in
lexical environments" by Paul Richard Hays, 1997, Birmingham, KWIC lines
are compared. Maybe it can be addapted for comparing sentences.
String edit distance can be used as a measure(sentences A and B are
similar to C if A and B can be mapped to C using the same number of
edits), but one could easily imagine another set of editing operations.
The application for which the measure is used influences it very much.
Papers which could help are:
- Shieber, Stuart (1993). The Problem of Logical-Form Equivalence,
Computational Linguistics, Vol 19, No. 1
- Spenader, Jennifer (2000). Defining Propositional Similarity:
Systemizing Identification of Presuppositional Binding. Proceedings of
Götalog 2000, Fourth Workshop on the Semantics and Pragmatics of
Dialogue, Göteborg University 15-17 June 2000.
- Emmanuel Planas, MT Summit VII: 'Formalizing Translation Memory'
- Manuel Montes-y-Gómez, Alexander Gelbukh, Aurelio López-López.
Comparison of Conceptual Graphs. Proc. MICAI-2000, 1st Mexican
International Conference on Artificial Intelligence, Acapulco, Mexico,
April 2000. In: O. Cairo, L.E. Sucar, F.J. Cantu (eds.) MICAI 2000:
Advances in Artificial Intelligence. Lecture Notes in Artificial
Intelligence N 1793, ISSN 0302-9743, ISBN
3-540-67354-7, Springer, pp. 548-556
- Kenneth C. Litkowski, 1999, Towards a Meaning-Full Comparison of
Lexical Resources, Proceeding of the Association for Computational
Linguistics Special Interest Group on the Lexicon, June 21-22, College
- Andreas Faatz, Designing clustering methods for ontology building: The
Mo K workbench
Distance metric can be useful on different levels and it is likely to be
applied on any material likely to be applied on any material (tokens,
part-of.speech, word-sense). A good introduction, theoretical, practical
didactic, can be found at:
Some (unix-like) c code can be downloaded here:
ThemeScape software might be useful. It scans entire documents in search
of similarity. They're at www.cartia.com.
You can download from the NIST site
(http://www.nist.gov/speech/tools/index.htm) some software called
"aldistsm-1.2.tar.Z" which computes an alignment (edit) distance between
two sentences, where the basic editing operations are changes in
phonological features, including splits and merges on the word level.
Computational Lingvistics Group
University of Wolverhampton
This archive was generated by hypermail 2b29 : Mon Nov 20 2000 - 16:45:16 MET