Re: [Corpora-List] 'imperfect' corpora

From: Sravana Reddy (sravana.reddy@gmail.com)
Date: Thu Nov 16 2006 - 17:59:42 MET

  • Next message: Ajith Abraham: "[Corpora-List] ISDA'07 - First Call for Papers"

    It looks like everyone is suggesting you correct the OCR errors rather than
    deal with the imperfect data in a good way. But I think the latter problem
    is more interesting, if only because you will most likely never have an
    error-free corpus. It's a harder machine-learning-level problem, but it
    doesn't mean it can't be solved. Unfortunately, I have no ideas about it.

    I think one of Kolak and Resnik's works deals with n-gram models at the
    character level. This means you are getting the power of dictionary
    lookups/correction, but with greater flexibility. It also lets you correct
    misplaced word boundaries by encoding spaces as characters -- which is one
    way to join word fragments that have been split and vice versa.

    On 11/16/06, Ed Kenschaft <ekenschaft@gmail.com> wrote:
    >
    > Okan Kolak did some work in OCR postprocessing, although I don't think
    > he's pursuing that currently.
    >
    > Try:
    >
    > Okan Kolak and Philip Resnik, "OCR Post-Processing for Low Density
    > Languages", HLT/EMNLP 2005, Vancouver, October 2005.
    >
    > On 11/16/06, Yannick Versley < versley@sfs.uni-tuebingen.de> wrote:
    > >
    > > Hi,
    > >
    > > > I have been given access to a large amount of data, which has been
    > > OCR'd
    > > > into a digital (.txt file) format. The data is extremely valuable for
    > > a
    > > > number of reasons and I would like to carry out, amongst other things,
    > > a
    > > > Keyword analysis. However, test-runs with corpus investigation tools
    > > show
    > > > that there are a few problems with the reliability of the corpus due
    > > to OCR
    > > > errors (mis-copying and fragmentation of words over end-of-line
    > > boundaries,
    > > > etc.).
    > > I think it may be worth trying to (semi-)automatically correct the most
    > > blatant of these errors, for example to merge word fragments that are
    > > split over the end of the line, or (assuming that the errors are rare in
    > > proportion to the rest) to correct rare words that do not occur in a
    > > dictionary or another known-good word list and are not capitalized (i.e.
    > > a
    > > named entity) to the nearest word that may be the correct spelling.
    > > Of course, there is much guesswork involved here, but if you aim for a
    > > keyword
    > > analysis, you have a better chance if you correct errors using a
    > > moderate
    > > amount of linguistic knowledge than if you just try to live with the
    > > noisy
    > > data.
    > >
    > > Best,
    > > Yannick Versley
    > >
    > > --
    > > Yannick Versley
    > > Seminar für Sprachwissenschaft, Abt. Computerlinguistik
    > > Wilhelmstr. 19, 72074 Tübingen
    > > Tel.: (07071) 29 77352
    > >
    > >
    >
    >
    > --
    > Ed Kenschaft
    > Ph.D. student, Computational Linguistics, University of Maryland
    > ekenschaft@gmail.com
    > www.umiacs.umd.edu/users/kensch/



    This archive was generated by hypermail 2b29 : Thu Nov 16 2006 - 17:57:18 MET