RE: [Corpora-List] pdfs/ OCR question

From: Hunter, Duncan (D.I.Hunter@warwick.ac.uk)
Date: Mon Dec 11 2006 - 19:56:11 MET

Next message: Klaus Guenther: "Re: [Corpora-List] pdfs/ OCR question"

Previous message: James_L._Fidelholtz: "[Corpora-List] Re: Google searches as linguistic evidence"
In reply to: Alexandre Rafalovitch: "Re: [Corpora-List] pdfs/ OCR question"
Next in thread: Klaus Guenther: "Re: [Corpora-List] pdfs/ OCR question"
Next in thread: William Fletcher: "RE: [Corpora-List] word frequencies on the web"
Reply: Klaus Guenther: "Re: [Corpora-List] pdfs/ OCR question"
Reply: John F. Sowa: "Re: [Corpora-List] pdfs/ OCR question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Thanks for this Alexandre.

interesting to know that pdf files store text info separately! .... that makes sense-and also means that the errors have already occured (at the stage of pdf creation).

It looks like the job of fixing the textual errors is a big one. I think it may simply be a question of accepting the limitations of the corpus we've generated by 'ripping' text from the imperfect pdf files?

Many thanks,

Duncan

________________________________

From: owner-corpora@lists.uib.no on behalf of Alexandre Rafalovitch
Sent: Mon 11/12/2006 16:21
To: corpora@uib.no
Subject: Re: [Corpora-List] pdfs/ OCR question

I would guess that the OCR had been done by the software that
generated the PDF. You might be able to check what it is by looking at
PDF document's properties. The text is stored on a separate layer from
the image and the reader just does region matching for the selection
purposes.

If you need to have this fixed, you probably will need to burst out
the PDF into its page images and have those re-OCRed.

Software you might find useful include PDFBox (http://www.pdfbox.org/)
and Gamera (http://ldp.library.jhu.edu/projects/gamera/)

You can also look at the Distributed Proofreaders to see if there is
anything to be learned from their experience: http://www.pgdp.net/

Regards,
Alex.

On 12/11/06, Hunter, Duncan <D.I.Hunter@warwick.ac.uk> wrote:
> Quick question about pdfs/ OCR:
>
> Some text is copied and from a pdf file and pasted into a text or Word file.
> It contains errors- say, for example, 'the' has become 'die' (you notice
> that in the original pdf the 't' and 'h' are quite close together). At what
> stage has this misrecognition/ miscopying occured?
> Where does the OCR take place? The OCR functionality is, presumably, part
> of of the .pdf reader software itself?
>
> Can anything be done to deal with the problem?
>
> Duncan Hunter
>
>

Next message: Klaus Guenther: "Re: [Corpora-List] pdfs/ OCR question"
Previous message: James_L._Fidelholtz: "[Corpora-List] Re: Google searches as linguistic evidence"
In reply to: Alexandre Rafalovitch: "Re: [Corpora-List] pdfs/ OCR question"
Next in thread: Klaus Guenther: "Re: [Corpora-List] pdfs/ OCR question"
Next in thread: William Fletcher: "RE: [Corpora-List] word frequencies on the web"
Reply: Klaus Guenther: "Re: [Corpora-List] pdfs/ OCR question"
Reply: John F. Sowa: "Re: [Corpora-List] pdfs/ OCR question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Dec 11 2006 - 20:03:51 MET