[Corpora-List] RE: automatic error correction

From: John Colby (colby@soe.ucsc.edu)
Date: Sun Dec 28 2003 - 22:51:04 MET

  • Next message: Nelleke Oostdijk: "[Corpora-List] LREC Workshop: compiling and processing spoken language corpora"

    Jason,

    I agree my explanation is very broad. Basically, I am developing
    techniques for grammatical error correction using insertion, deletion and
    substitution. What you would call the robust parsing approach, although I
    plan to combine both analytical techniques and statistical ones. For
    example, the grammar and an equivalent PDA built from it gives one the
    ability to answer questions like "When a parser in some parse state is
    faced with error symbol X, what symbols can I add or delete from the
    input, or substitute for X (and maybe some trailing symbols), which will
    allow the parser to continue?" It is also possible to analytically
    determine what corrections guarantee that the parser will continue to
    completion. In a broad statistical extension of this robust parsing
    approach, one could choose the corrections which are most likely to lead
    to the best parse. This technique need not be entirely local, as it is
    possible to determine the language of sequences of error corrections to
    apply at different places in any input generated by the PDA. Moreover, if
    one complements left-to-right parsing with right-to-left parsing it is
    possible to locate and correct errors which occur before the parser
    encounters a symbol which it cannot process.

    That is what approach I choose to use. I didn't restrict my question to
    just ASR or OCR or discourse analysis because I believe all of these
    applications can be approached from a grammatical perpsective. And I
    didn't restrict my question to a particular way of approaching any of
    these applications because I want to look at all of the solutions which
    others have developed for these problems.

    Thank you for your message which gives me some more information as to how
    one could approach error correction in each of these domains. After
    reading your message I took a look at Jurafsky and Martin's book and
    Jelinek's book. I didn't realize J & M had any material on error
    correction for ASR and for OCR.

    Thanks again,
    John

    > John,
    >
    > Your explanation is still very broad. You ought to tell me concretely
    what situation you find yourself in -- what are you actually trying to
    *do*? I doubt you're trying to build a universal error correction
    device that is capable of dealing with ASR *and* OCR *and* discourse
    ...!
    >
    > A noisy-channel approach says: I've observed erroneous signal Y. What
    was the true signal X? Let's consider every possible X, and pick the
    one that maximizes
    > p(X) * p(Y | X)
    > where p(X) is the a priori probability that the truth was X, and p(Y |
    X) is the probability that you would have observed Y if the truth was X.
     The art here is in defining these two probability distributions. p(Y |
    X), for example, is a model of how errors are typically
    > introduced. It would be defined as a function of Y, X, and some
    parameters. The form of the function (stipulated by an expert)
    > describes what *kinds* of errors are introduced, and the parameters
    (typically learned from data) quantify how often the different errors
    are introduced, so that one can compute a numerical probability p(Y |
    X).
    >
    > You could regard an ASR system as a noisy channel. X is the true text,
    and Y is the text produced by the ASR system. So you would need to
    model the kinds of errors that are introduced by the system.
    >
    > However, you should realize that ASR and OCR systems are themselves
    built as noisy channel models. For example, in ASR, X is the text, and
    Y is the speech. Given Y, the ASR system wants to find the most
    probable X. You would have to work pretty hard to build a noisy channel
    model that can correct errors beyond what the ASR system already does --
    unless the ASR system is making assumptions that are not warranted in
    your situation (e.g., it thinks you're speaking English in a quiet room,
    whereas in fact you're reading computer code in a windy echo chamber).
    >
    > You can find out more about noisy channel models in any good
    > natural language processing or speech processing textbook.
    > A couple of textbooks are recommended at my course homepage:
    > http://cs.jhu.edu/~jason/465
    >
    > For grammatical errors, you may want to build what is called a robust
    parser, which is capable of parsing sentences that are "not quite
    grammatical," correcting them in the process. But this is really an
    instance of the noisy channel model. First you build an ordinary
    context-free parser, which defines p(X), and then you compose it with a
    error model p(Y | X), which is typically a finite-state transducer.
    (Finite-state devices are in general very useful in building noisy
    channel models.)
    >
    > There are other ways to correct localized errors. You can build a
    machine learning classifier that knows how to correct a particular kind
    of error, e.g.., it knows when to change "there" to "they're." The
    classifier considers a bunch of contextual features, such as surrounding
    words. There are many techniques for this, since machine learning is a
    large field. Google for "context-sensitive spelling correction" to get
    a few of the basic ones (early projects by
    > Yarowsky, Golding, and Roth). You can apply the individual
    > classifiers in parallel or in sequence, perhaps using a technique like
    Transformation-Based Learning to choose the best order.
    >
    > Hope this helps lay out some of the territory, but the best technique to
    use will vary from situation to situation. You have to look at the data
    to see (1) what kinds of errors show up in real examples and (2) what
    else in the example tells you that they're errors and how to correct
    them.
    >
    > -jason
    >
    >> The kinds of errors I am talking about are ones in which the input does
    not match what a predictive processor expects. In ASR, the system might
    not recognize the next phoneme or word. An OCR system might produce
    output which isn't correct from the view of what the legal spelling of
    the
    >> next word to be processed is or even what the next expected word should
    be. In textual processing, the kinds of errors would be mainly ones in
    which the next word of the input is spelled incorrectly or doesn't
    match any of the words the text processor expects to see.
    Alternatively, a discourse processing system might expect a missing cue
    or a type of discourse component.
    >>
    >> In a way, the errors can all be cast as grammatical ones, but I don't
    choose to do so because I am interested in all of the different methods
    which have been used with predictive processors to automatically
    correct errors.
    >>
    >> -John
    >>
    >> > Is there some particular kind of error you are interested in
    >> > correcting? Might help me decide what reading to recommend.
    >> >
    >> > -j
    >> >
    >> >
    >> >> Jason,
    >> >>
    >> >> I am not familiar with the noisy channel model. If you could point
    >> me
    >> >> in
    >> >> the direction of references describing methods which have been used
    >> for
    >> >> automatic error correction, especially of the types you listed, it
    >> would
    >> >> help me very much.
    >> >>
    >> >> Best,
    >> >> John
    >> >>
    >> >>
    >> >> >> What systems and articles are you aware of with regard to
    >> automatic
    >> >> >> error
    >> >> >> correction for use with speech and textual processing and OCR
    interfaces?
    >> >> >
    >> >> > Standard approach is to use a noisy channel model. I'd implement
    >> it
    >> >> > using a toolkit for weighted finite-state transducers, such as
    >> AT&T's
    >> >> > FSM toolkit.
    >> >> >
    >> >> > For context-sensitive spelling correction, another alternative is
    >> to
    >> >> > use a word-sense disambiguation system.
    >> >> >
    >> >> > If these terms are unfamiliar and aren't explained by other
    respondents, write back and I can try to point you in the right
    direction.
    >> >> >
    >> >> > Good luck,
    >> >> > Jason Eisner
    >> >> >
    >> >>
    >> >
    >>
    >>
    >>
    >



    This archive was generated by hypermail 2b29 : Sun Dec 28 2003 - 22:51:04 MET