Re: Corpora: approximations (bounds) for edit distance

From: Computer Researcher (compresearch@hotmail.com)
Date: Sat Dec 01 2001 - 19:33:25 MET

  • Next message: geoffrey.williams: "Corpora: Collocation workshop: 2nd Call for Papers"

    Hi, thanks for the responses.

    I'm looking for tighter bounds than 0 and the length of the string, by using
    some preprocessing on the strings. So we compute some info in the initial
    time, and we use this later.

    Assuming that cost of insert, delete, and update are all 1; we can find a
    lower-bound as follows:

    -----
    Let X=(x_1, x_2, x_3, .., x_n) and Y=(y_1, y_2, y_3, .., y_n) be the number
    of occurrences of the letters in string A, and B, respectively. We define a
    distance between these two vectors X,Y as follows:

    if (x_i > y_i) P_i = x_i - y_i
    else N_i = y_i - x_i

    and distance(X,Y) = Max(summation of all P_is, summation of all N_is) is the
    lower bound for the edit distance(A,B).

    This can be proved. Actually this distance is also a metric, and it is a
    better bound than 0.
    -----

    I know a paper (published in 1988) that shows this bound. I had given the
    reference in my previous message.

    The question is can we come up with a better lower bound? And can we come up
    an upper bound (better than the length of the distance)? Note that, the
    reason we can find such bounds is that we use some preprocessing on the
    strings for later usage.
    OR can we come up with an approximation like this (not necessarily a bound
    but some distance that is a good approximation to edit distance).

    Thanks again for your attention..

    CR

    >From: Robert Luk (COMP staff) <csrluk@comp.polyu.edu.hk>
    >To: CORPORA@HD.UIB.NO, lambertb@uic.edu
    >Subject: Re: Corpora: approximations (bounds) for edit distance
    >Date: Sat, 1 Dec 2001 07:22:23 +0800 (HKT)
    >
    >Hi Bruce,
    >
    > > Maybe I'm missing something, but the upper bound on edit distance
    >between
    > > two strings is always the length of the longer string, and the lower
    >bound
    > > is always zero (when the strings are identical).
    >
    >I guess it depends on the edit cost of the operations.
    >For most applications, we do expect all edit costs to
    >be larger than or equals to zero and the lb edit
    >distance between any 2 strings is 0. However, the ub
    >of any 2 strings is the length of the longer string
    >is true (I guess) if insert, delete and substitution
    >edit costs are 1 and 0 for correct substitution.
    >If I remember correctly, that's the Levenstein
    >metric but one needs to check the details for
    >correctness. In the applications that I have worked,
    >it is not uncommon to use edit costs are than
    >the Levenstein.
    >
    >Cheers,
    >
    >Robert Luk
    >

    _________________________________________________________________
    Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp



    This archive was generated by hypermail 2b29 : Sat Dec 01 2001 - 19:48:11 MET