Re: [Corpora-List] 'imperfect' corpora

From: Mirko Tavosanis (tavosanis@ital.unipi.it)
Date: Thu Nov 16 2006 - 18:25:04 MET

  • Next message: Paul Buitelaar: "[Corpora-List] OntoSelect - a multilingual ontology library and ontology selection service"

    Hi, Duncan,

    as for OCR problems, you can probably use:

    1. Christoph Ringlstetter, Klaus U. Schulz and
    Stoyan Mihov: Orthographic Errors in Web Pages -
    Towards Cleaner Web Corpora. Computational Linguistics 32(3): 295-340.

    2. Strohmaier, Christian, Christoph Ringlstetter,
    Klaus U. Schulz, and Stoyan Mihov. 2003a.
    Lexical postcorrection of OCR-results: The
    web as a dynamic secondary dictionary?
    In Proceedings of the Seventh International
    Conference on Document Analysis and
    Recognition (ICDAR 03), pages 1133–1137,
    Edinburgh.

    3. Strohmaier, Christian, Christoph Ringlstetter,
    Klaus U. Schulz, and Stoyan Mihov.
    A visual and interactive tool for
    optimizing lexical postcorrection of
    OCR results. In Proceedings of the IEEE
    Workshop on Document Image Analysis
    and Recognition, DIAR’03, Madison, WI.

    4. Ringlstetter, Christoph. 2003. OCRKorrektur
    und Bestimmung von
    Levenshtein-Gewichten. Master’s
    thesis, LMU, University of Munich.

    Mirko Tavosanis
    Dipartimento di Studi italianistici
    Universita' di Pisa
    http://www.humnet.unipi.it/ital/



    This archive was generated by hypermail 2b29 : Thu Nov 16 2006 - 19:31:02 MET