Re: Corpora: approximations (bounds) for edit distance

From: COMP staff (csrluk@comp.polyu.edu.hk)
Date: Mon Dec 03 2001 - 02:34:37 MET

Next message: John Goldsmith: "RE: Corpora: approximations (bounds) for edit distance"

Previous message: David Lee: "Corpora: New Bookmark site for Corpus-based Linguists"
Maybe in reply to: Computer Researcher: "Corpora: approximations (bounds) for edit distance"
Next in thread: John Goldsmith: "RE: Corpora: approximations (bounds) for edit distance"
Reply: John Goldsmith: "RE: Corpora: approximations (bounds) for edit distance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi CR,
Hi CR,

Thanks for the clarification. In other words, what you want is a (approx.) lower/upper
bound estimator based on some preprocessed information of 2 input
strings x and y? Whether these are the least upper or larget lower
bounds are still unknown unless proven. The estimator below might
be improved by the information about the knowledge of the length
of the 2 strings. The least combined insert and delete operations
must be at least abs(|x| - |y|) where x and y are the input strings
and |.| is the length of the argument. My guess is that you want
this preprocessing to have a better complexity than O(|x| * |y|) otherwise
one would perform the DP. Another method is to run a left-to-right match
with a context window of size abs(|x| - |y|) and compute the max or min differences
and sum them up as some kind of bounds. However, this requires O(|x| |y|) cost in general.
Another less expensive one is to use a fixed size window (say within 4 characters)
to look for these edit distance bounds and the time complexity is O(|x|), which is
the same if you compute the character occurrences. You can also increase the
window size to get better and better estimates of the bounds.
My other guess is that people working on DNA sequence matching might have
worked on this problem. That's all I could contribute - sorry.

Best,

Robert Luk

> I'm looking for tighter bounds than 0 and the length of the string, by using
> some preprocessing on the strings. So we compute some info in the initial
> time, and we use this later.
>
> Assuming that cost of insert, delete, and update are all 1; we can find a
> lower-bound as follows:
>
> -----
> Let X=(x_1, x_2, x_3, .., x_n) and Y=(y_1, y_2, y_3, .., y_n) be the number
> of occurrences of the letters in string A, and B, respectively. We define a
> distance between these two vectors X,Y as follows:
>
> if (x_i > y_i) P_i = x_i - y_i
> else N_i = y_i - x_i
>
> and distance(X,Y) = Max(summation of all P_is, summation of all N_is) is the
> lower bound for the edit distance(A,B).
>
> This can be proved. Actually this distance is also a metric, and it is a
> better bound than 0.
> -----
>
> I know a paper (published in 1988) that shows this bound. I had given the
> reference in my previous message.
>
> The question is can we come up with a better lower bound? And can we come up
> an upper bound (better than the length of the distance)? Note that, the
> reason we can find such bounds is that we use some preprocessing on the
> strings for later usage.
> OR can we come up with an approximation like this (not necessarily a bound
> but some distance that is a good approximation to edit distance).
>
> Thanks again for your attention..
>
> CR
>
>
>
> >From: Robert Luk (COMP staff) <csrluk@comp.polyu.edu.hk>
> >To: CORPORA@HD.UIB.NO, lambertb@uic.edu
> >Subject: Re: Corpora: approximations (bounds) for edit distance
> >Date: Sat, 1 Dec 2001 07:22:23 +0800 (HKT)
> >
> >Hi Bruce,
> >
> > > Maybe I'm missing something, but the upper bound on edit distance
> >between
> > > two strings is always the length of the longer string, and the lower
> >bound
> > > is always zero (when the strings are identical).
> >
> >I guess it depends on the edit cost of the operations.
> >For most applications, we do expect all edit costs to
> >be larger than or equals to zero and the lb edit
> >distance between any 2 strings is 0. However, the ub
> >of any 2 strings is the length of the longer string
> >is true (I guess) if insert, delete and substitution
> >edit costs are 1 and 0 for correct substitution.
> >If I remember correctly, that's the Levenstein
> >metric but one needs to check the details for
> >correctness. In the applications that I have worked,
> >it is not uncommon to use edit costs are than
> >the Levenstein.
> >
> >Cheers,
> >
> >Robert Luk
> >
>
>
> _________________________________________________________________
> Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp
>
>

Next message: John Goldsmith: "RE: Corpora: approximations (bounds) for edit distance"
Previous message: David Lee: "Corpora: New Bookmark site for Corpus-based Linguists"
Maybe in reply to: Computer Researcher: "Corpora: approximations (bounds) for edit distance"
Next in thread: John Goldsmith: "RE: Corpora: approximations (bounds) for edit distance"
Reply: John Goldsmith: "RE: Corpora: approximations (bounds) for edit distance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Dec 03 2001 - 02:39:06 MET