Re: Corpora: approximations (bounds) for edit distance

From: COMP staff (csrluk@comp.polyu.edu.hk)
Date: Sat Dec 01 2001 - 11:48:49 MET

  • Next message: Computer Researcher: "Re: Corpora: approximations (bounds) for edit distance"

    Hi Bruce,

    > I agree. I was describing the simplest possible edit distance algorithm,
    > with mismatch costs equal to 1 and correct substitutions equal to 0. Edit
    > distance gets interesting and powerful when one uses a matrix of
    > substitution costs, one cost for each possible
    > substitution/insertion/deletion in an alphabet, and when the costs reflect
    > some inverse function of similarity or confusability "in the real world".
    >
    > The real problem with edit distance is that is slow. What approaches people
    > are using to speed up edit distance calculations? One heuristic approach
    > that has been tried is to first use a fast approach like ngram similarity,
    > then only compute edit distance on pairs whose ngram similarities exceed
    > some threshold. Are there other methods?

    There were some research in fast Viterbi, A* algorithms, Dijsktra in the speech
    processing community (around early 1990s in IEEE ICASSP), which they
    are relevant. Some kind of prunning would help if string lengths are
    very long. However, my knowledge is limited to here. My experience with approximate
    matching words is not that slow. If you want to do approximate match of a set
    of strings with 1 string, obviously a different algo should be used.

    Best,

    Robert Luk



    This archive was generated by hypermail 2b29 : Sat Dec 01 2001 - 19:14:07 MET