RE: [Corpora-List] 'imperfect' corpora

From: Eric Ringger (ringger@cs.byu.edu)
Date: Thu Nov 16 2006 - 23:52:31 MET

  • Next message: Constantin Orasan: "[Corpora-List] Final reminder for Research studentship in spatial reasoning for question answering (£10,000 per year)"

    Thanks to all for the interesting references.

    As a Ph.D. student, I conducted some related research on the post-correction
    of speech recognition results. Here is the briefest noteworthy reference:

    Eric K. Ringger and James F. Allen. "A Fertility Channel Model for
    Post-Correction of Continuous Speech Recognition." Proceedings of the Fourth
    International Conference on Spoken Language Processing (ICSLP'96).
    Philadelphia, PA. October 1996.

    http://www.cs.rochester.edu/u/ringger/research/icslp-96.html

    As no automatic post-correction technique will itself be perfect, I agree
    with Sravana Reddy that there is much to be said for corpus analysis
    techniques that are robust to the errors which inevitably occur in the
    process of automatic document acquisition (OCR, speech recognition, ...).

    Many of the automatic post-correction techniques referenced in this thread
    leverage common error instances and types. One would expect robust corpus
    analysis techniques at least to be able to see through the infrequent,
    random errors.

    Regards,
    --Eric
    http://faculty.cs.byu.edu/~ringger/

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Mirko Tavosanis
    Sent: Thursday, November 16, 2006 10:25 AM
    To: Hunter, Duncan; corpora@lists.uib.no
    Subject: Re: [Corpora-List] 'imperfect' corpora

    Hi, Duncan,

    as for OCR problems, you can probably use:

    1. Christoph Ringlstetter, Klaus U. Schulz and
    Stoyan Mihov: Orthographic Errors in Web Pages -
    Towards Cleaner Web Corpora. Computational Linguistics 32(3): 295-340.

    2. Strohmaier, Christian, Christoph Ringlstetter,
    Klaus U. Schulz, and Stoyan Mihov. 2003a.
    Lexical postcorrection of OCR-results: The
    web as a dynamic secondary dictionary?
    In Proceedings of the Seventh International
    Conference on Document Analysis and
    Recognition (ICDAR 03), pages 1133-1137,
    Edinburgh.

    3. Strohmaier, Christian, Christoph Ringlstetter,
    Klaus U. Schulz, and Stoyan Mihov.
    A visual and interactive tool for
    optimizing lexical postcorrection of
    OCR results. In Proceedings of the IEEE
    Workshop on Document Image Analysis
    and Recognition, DIAR'03, Madison, WI.

    4. Ringlstetter, Christoph. 2003. OCRKorrektur
    und Bestimmung von
    Levenshtein-Gewichten. Master's
    thesis, LMU, University of Munich.

    Mirko Tavosanis
    Dipartimento di Studi italianistici
    Universita' di Pisa
    http://www.humnet.unipi.it/ital/



    This archive was generated by hypermail 2b29 : Fri Nov 17 2006 - 00:18:47 MET