Re: [Corpora-List] 'imperfect' corpora

From: Ed Kenschaft (ekenschaft@gmail.com)
Date: Thu Nov 16 2006 - 14:50:27 MET

  • Next message: Hong Huaqing: "RE: [Corpora-List] transcribing video corpora"

    Okan Kolak did some work in OCR postprocessing, although I don't think he's
    pursuing that currently.

    Try:

    Okan Kolak and Philip Resnik, "OCR Post-Processing for Low Density
    Languages", HLT/EMNLP 2005, Vancouver, October 2005.

    On 11/16/06, Yannick Versley <versley@sfs.uni-tuebingen.de> wrote:
    >
    > Hi,
    >
    > > I have been given access to a large amount of data, which has been
    > OCR'd
    > > into a digital (.txt file) format. The data is extremely valuable for a
    > > number of reasons and I would like to carry out, amongst other things, a
    > > Keyword analysis. However, test-runs with corpus investigation tools
    > show
    > > that there are a few problems with the reliability of the corpus due to
    > OCR
    > > errors (mis-copying and fragmentation of words over end-of-line
    > boundaries,
    > > etc.).
    > I think it may be worth trying to (semi-)automatically correct the most
    > blatant of these errors, for example to merge word fragments that are
    > split over the end of the line, or (assuming that the errors are rare in
    > proportion to the rest) to correct rare words that do not occur in a
    > dictionary or another known-good word list and are not capitalized (i.e. a
    > named entity) to the nearest word that may be the correct spelling.
    > Of course, there is much guesswork involved here, but if you aim for a
    > keyword
    > analysis, you have a better chance if you correct errors using a moderate
    > amount of linguistic knowledge than if you just try to live with the noisy
    > data.
    >
    > Best,
    > Yannick Versley
    >
    > --
    > Yannick Versley
    > Seminar für Sprachwissenschaft, Abt. Computerlinguistik
    > Wilhelmstr. 19, 72074 Tübingen
    > Tel.: (07071) 29 77352
    >
    >

    -- 
    Ed Kenschaft
    Ph.D. student, Computational Linguistics, University of Maryland
    ekenschaft@gmail.com
    www.umiacs.umd.edu/users/kensch/
    



    This archive was generated by hypermail 2b29 : Thu Nov 16 2006 - 14:47:58 MET