Re: Corpora: Summary: Measures for similarity between two sentences

From: Tom Vanallemeersch (
Date: Mon Nov 20 2000 - 18:24:12 MET

  • Next message: Ivana Kruijff-Korbayova: "Corpora: CFP: "Information Stucture, Discourse Structure and Discourse Semantics" -Workshop at ESSLLI 2001"

    Sorry for this late reply (I just got back from vacation). I developed
    functionality in Emacs for comparing two sentences, more specifically
    translations. It detects common strings in two sentences and small
    variants (spelling variants etc.). The common parts are visualized using
    colors, and two scores are computed, a similarity score and a score
    expressing the effort needed to modify the first sentence into the
    second one.

    In the picture below are a few sentences from the CRATER corpus, i.e. 2
    raw sentences and the corresponding tagged sentences. Each example is
    delimited by a dashed line. Common strings are highlighted in grey, small
    differences underlined, and common strings with a different order in both
    sentences start with a green block. Below the sentences compared is a line
    with the similarity score and the effort score. The effort score is
    calculated on the basis of the number of deletions, insertions, and common
    parts with different order in both sentences. The higher the effort score,
    the more effort is needed. The similarity score depends on the effort
    score and the length of the sentences. A similarity score of 1 indicates
    equalness. In case of the tagged sentences, the tags are considered part
    of the text (i.e. not recognized as such). A higher correspondence between
    tags will produce a higher similarity score.

    Hope this helps,


    LANT nv/sa, Research Park Haasrode, Interleuvenlaan 21, B-3001 Leuven               Phone: ++32 16 405140                             Fax: ++32 16 404961

    [From Admin Corpora list:

    Picture at: ]

    This archive was generated by hypermail 2b29 : Tue Nov 21 2000 - 09:51:56 MET