Re: Corpora: Summary: Measures for similarity between two sentences

From: Tom Vanallemeersch (Tom.Vanallemeersch@lant.be)
Date: Mon Nov 20 2000 - 18:24:12 MET

Next message: Ivana Kruijff-Korbayova: "Corpora: CFP: "Information Stucture, Discourse Structure and Discourse Semantics" -Workshop at ESSLLI 2001"

Previous message: Robert Luk: "Corpora: Re: Please help distribute the following CFP to your community"
In reply to: Constantin Orasan: "Corpora: Summary: Measures for similarity between two sentences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Sorry for this late reply (I just got back from vacation). I developed
functionality in Emacs for comparing two sentences, more specifically
translations. It detects common strings in two sentences and small
variants (spelling variants etc.). The common parts are visualized using
colors, and two scores are computed, a similarity score and a score
expressing the effort needed to modify the first sentence into the
second one.

In the picture below are a few sentences from the CRATER corpus, i.e. 2
raw sentences and the corresponding tagged sentences. Each example is
delimited by a dashed line. Common strings are highlighted in grey, small
differences underlined, and common strings with a different order in both
sentences start with a green block. Below the sentences compared is a line
with the similarity score and the effort score. The effort score is
calculated on the basis of the number of deletions, insertions, and common
parts with different order in both sentences. The higher the effort score,
the more effort is needed. The similarity score depends on the effort
score and the length of the sentences. A similarity score of 1 indicates
equalness. In case of the tagged sentences, the tags are considered part
of the text (i.e. not recognized as such). A higher correspondence between
tags will produce a higher similarity score.

Hope this helps,

Tom.

-- LANT nv/sa, Research Park Haasrode, Interleuvenlaan 21, B-3001 Leuven mailto:Tom.Vanallemeersch@lant.be Phone: ++32 16 405140 http://www.lant.be/ Fax: ++32 16 404961

[From Admin Corpora list:

Picture at: http://www.hit.uib.no/corpora/compsent.gif ]

Next message: Ivana Kruijff-Korbayova: "Corpora: CFP: "Information Stucture, Discourse Structure and Discourse Semantics" -Workshop at ESSLLI 2001"
Previous message: Robert Luk: "Corpora: Re: Please help distribute the following CFP to your community"
In reply to: Constantin Orasan: "Corpora: Summary: Measures for similarity between two sentences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Nov 21 2000 - 09:51:56 MET