Re: Corpora: approximations (bounds) for edit distance

From: COMP staff (csrluk@comp.polyu.edu.hk)
Date: Mon Dec 03 2001 - 02:34:37 MET

  • Next message: John Goldsmith: "RE: Corpora: approximations (bounds) for edit distance"

    Hi CR,
    Hi CR,

    Thanks for the clarification. In other words, what you want is a (approx.) lower/upper
    bound estimator based on some preprocessed information of 2 input
    strings x and y? Whether these are the least upper or larget lower
    bounds are still unknown unless proven. The estimator below might
    be improved by the information about the knowledge of the length
    of the 2 strings. The least combined insert and delete operations
    must be at least abs(|x| - |y|) where x and y are the input strings
    and |.| is the length of the argument. My guess is that you want
    this preprocessing to have a better complexity than O(|x| * |y|) otherwise
    one would perform the DP. Another method is to run a left-to-right match
    with a context window of size abs(|x| - |y|) and compute the max or min differences
    and sum them up as some kind of bounds. However, this requires O(|x| |y|) cost in general.
    Another less expensive one is to use a fixed size window (say within 4 characters)
    to look for these edit distance bounds and the time complexity is O(|x|), which is
    the same if you compute the character occurrences. You can also increase the
    window size to get better and better estimates of the bounds.
    My other guess is that people working on DNA sequence matching might have
    worked on this problem. That's all I could contribute - sorry.

    Best,

    Robert Luk

    > I'm looking for tighter bounds than 0 and the length of the string, by using
    > some preprocessing on the strings. So we compute some info in the initial
    > time, and we use this later.
    >
    > Assuming that cost of insert, delete, and update are all 1; we can find a
    > lower-bound as follows:
    >
    > -----
    > Let X=(x_1, x_2, x_3, .., x_n) and Y=(y_1, y_2, y_3, .., y_n) be the number
    > of occurrences of the letters in string A, and B, respectively. We define a
    > distance between these two vectors X,Y as follows:
    >
    > if (x_i > y_i) P_i = x_i - y_i
    > else N_i = y_i - x_i
    >
    > and distance(X,Y) = Max(summation of all P_is, summation of all N_is) is the
    > lower bound for the edit distance(A,B).
    >
    > This can be proved. Actually this distance is also a metric, and it is a
    > better bound than 0.
    > -----
    >
    > I know a paper (published in 1988) that shows this bound. I had given the
    > reference in my previous message.
    >
    > The question is can we come up with a better lower bound? And can we come up
    > an upper bound (better than the length of the distance)? Note that, the
    > reason we can find such bounds is that we use some preprocessing on the
    > strings for later usage.
    > OR can we come up with an approximation like this (not necessarily a bound
    > but some distance that is a good approximation to edit distance).
    >
    > Thanks again for your attention..
    >
    > CR
    >
    >
    >
    > >From: Robert Luk (COMP staff) <csrluk@comp.polyu.edu.hk>
    > >To: CORPORA@HD.UIB.NO, lambertb@uic.edu
    > >Subject: Re: Corpora: approximations (bounds) for edit distance
    > >Date: Sat, 1 Dec 2001 07:22:23 +0800 (HKT)
    > >
    > >Hi Bruce,
    > >
    > > > Maybe I'm missing something, but the upper bound on edit distance
    > >between
    > > > two strings is always the length of the longer string, and the lower
    > >bound
    > > > is always zero (when the strings are identical).
    > >
    > >I guess it depends on the edit cost of the operations.
    > >For most applications, we do expect all edit costs to
    > >be larger than or equals to zero and the lb edit
    > >distance between any 2 strings is 0. However, the ub
    > >of any 2 strings is the length of the longer string
    > >is true (I guess) if insert, delete and substitution
    > >edit costs are 1 and 0 for correct substitution.
    > >If I remember correctly, that's the Levenstein
    > >metric but one needs to check the details for
    > >correctness. In the applications that I have worked,
    > >it is not uncommon to use edit costs are than
    > >the Levenstein.
    > >
    > >Cheers,
    > >
    > >Robert Luk
    > >
    >
    >
    > _________________________________________________________________
    > Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp
    >
    >



    This archive was generated by hypermail 2b29 : Mon Dec 03 2001 - 02:39:06 MET