Re: [Corpora-List] 'imperfect' corpora

From: Yannick Versley (versley@sfs.uni-tuebingen.de)
Date: Thu Nov 16 2006 - 09:48:15 MET

  • Next message: hkaalep: "CHILDES (Was: RE: [Corpora-List] Re: transcribing video corpora)"

    Hi,

    > I have been given access to a large amount of data, which has been OCR'd
    > into a digital (.txt file) format. The data is extremely valuable for a
    > number of reasons and I would like to carry out, amongst other things, a
    > Keyword analysis. However, test-runs with corpus investigation tools show
    > that there are a few problems with the reliability of the corpus due to OCR
    > errors (mis-copying and fragmentation of words over end-of-line boundaries,
    > etc.).
    I think it may be worth trying to (semi-)automatically correct the most
    blatant of these errors, for example to merge word fragments that are
    split over the end of the line, or (assuming that the errors are rare in
    proportion to the rest) to correct rare words that do not occur in a
    dictionary or another known-good word list and are not capitalized (i.e. a
    named entity) to the nearest word that may be the correct spelling.
    Of course, there is much guesswork involved here, but if you aim for a keyword
    analysis, you have a better chance if you correct errors using a moderate
    amount of linguistic knowledge than if you just try to live with the noisy
    data.

    Best,
    Yannick Versley

    -- 
    Yannick Versley
    Seminar für Sprachwissenschaft, Abt. Computerlinguistik
    Wilhelmstr. 19, 72074 Tübingen
    Tel.: (07071) 29 77352
    



    This archive was generated by hypermail 2b29 : Thu Nov 16 2006 - 10:15:25 MET