RE: [Corpora-List] 'imperfect' corpora

From: Eric Ringger (ringger@cs.byu.edu)
Date: Thu Nov 16 2006 - 23:52:31 MET

Next message: Constantin Orasan: "[Corpora-List] Final reminder for Research studentship in spatial reasoning for question answering (Ł10,000 per year)"

Previous message: Paul Buitelaar: "[Corpora-List] OntoSelect - a multilingual ontology library and ontology selection service"
In reply to: Mirko Tavosanis: "Re: [Corpora-List] 'imperfect' corpora"
Next in thread: Hunter, Duncan: "RE: [Corpora-List] 'imperfect' corpora"
Reply: Hunter, Duncan: "RE: [Corpora-List] 'imperfect' corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Thanks to all for the interesting references.

As a Ph.D. student, I conducted some related research on the post-correction
of speech recognition results. Here is the briefest noteworthy reference:

Eric K. Ringger and James F. Allen. "A Fertility Channel Model for
Post-Correction of Continuous Speech Recognition." Proceedings of the Fourth
International Conference on Spoken Language Processing (ICSLP'96).
Philadelphia, PA. October 1996.

http://www.cs.rochester.edu/u/ringger/research/icslp-96.html

As no automatic post-correction technique will itself be perfect, I agree
with Sravana Reddy that there is much to be said for corpus analysis
techniques that are robust to the errors which inevitably occur in the
process of automatic document acquisition (OCR, speech recognition, ...).

Many of the automatic post-correction techniques referenced in this thread
leverage common error instances and types. One would expect robust corpus
analysis techniques at least to be able to see through the infrequent,
random errors.

Regards,
--Eric
http://faculty.cs.byu.edu/~ringger/

-----Original Message-----
From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
Behalf Of Mirko Tavosanis
Sent: Thursday, November 16, 2006 10:25 AM
To: Hunter, Duncan; corpora@lists.uib.no
Subject: Re: [Corpora-List] 'imperfect' corpora

Hi, Duncan,

as for OCR problems, you can probably use:

1. Christoph Ringlstetter, Klaus U. Schulz and
Stoyan Mihov: Orthographic Errors in Web Pages -
Towards Cleaner Web Corpora. Computational Linguistics 32(3): 295-340.

2. Strohmaier, Christian, Christoph Ringlstetter,
Klaus U. Schulz, and Stoyan Mihov. 2003a.
Lexical postcorrection of OCR-results: The
web as a dynamic secondary dictionary?
In Proceedings of the Seventh International
Conference on Document Analysis and
Recognition (ICDAR 03), pages 1133-1137,
Edinburgh.

3. Strohmaier, Christian, Christoph Ringlstetter,
Klaus U. Schulz, and Stoyan Mihov.
A visual and interactive tool for
optimizing lexical postcorrection of
OCR results. In Proceedings of the IEEE
Workshop on Document Image Analysis
and Recognition, DIAR'03, Madison, WI.

4. Ringlstetter, Christoph. 2003. OCRKorrektur
und Bestimmung von
Levenshtein-Gewichten. Master's
thesis, LMU, University of Munich.

Mirko Tavosanis
Dipartimento di Studi italianistici
Universita' di Pisa
http://www.humnet.unipi.it/ital/

Next message: Constantin Orasan: "[Corpora-List] Final reminder for Research studentship in spatial reasoning for question answering (Ł10,000 per year)"
Previous message: Paul Buitelaar: "[Corpora-List] OntoSelect - a multilingual ontology library and ontology selection service"
In reply to: Mirko Tavosanis: "Re: [Corpora-List] 'imperfect' corpora"
Next in thread: Hunter, Duncan: "RE: [Corpora-List] 'imperfect' corpora"
Reply: Hunter, Duncan: "RE: [Corpora-List] 'imperfect' corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Nov 17 2006 - 00:18:47 MET