RE: [Corpora-List] PDF Conversion

From: TadPiotr (tadpiotr@plusnet.pl)
Date: Wed Mar 29 2006 - 13:54:49 MET DST

Next message: xing: "[Corpora-List] Graded reading materials"

Previous message: ELDA: "[Corpora-List] ELRA - Language Resources Catalogue - Update"
In reply to: Brett Powley: "Re: [Corpora-List] PDF Conversion"
Next in thread: Victor Kapustin: "RE: [Corpora-List] PDF Conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello everybody,
there is a commercial OCR package, FineReader, which can read a PDF file, no
matter whether it is text or bitmap (so you do not need the conversion into
bitmaps). It is not very efficient when you have a very complex layout (e.g.
a tabloid) but otherwise performs quite well. It tries hard to reproduce the
formatting of PDF pages, including headers and the like.
Best wishes,
Tadeusz Piotrowski

> -----Original Message-----
> From: owner-corpora@lists.uib.no
> [mailto:owner-corpora@lists.uib.no] On Behalf Of Brett Powley
> Sent: Wednesday, March 29, 2006 5:12 AM
> To: Ken Litkowski
> Cc: corpora@hd.uib.no
> Subject: Re: [Corpora-List] PDF Conversion
>
> Hi Ken,
>
> The work I have been doing (with the ACL anthology) involves
> doing precisely this. I spent some time evaluating tools to
> do it, including:
>
> Adobe Reader (using Save as Text...)
> Multivalent (Java, open source)
> PDFBox (Java, open source)
> XPDF (open source)
> Etymon Pjx (open source)
> PDFTextStream (commercial)
> JPedal (commercial)
> Argus (commercial)
> 3-heights PDF extract (pdf-tools) (commercial)
>
> (I also looked to see whether Mac OS X provided any API for
> text extraction since it has built-in PDF support and it
> indexes PDF documents, but if there is an API it's not a
> public one yet.)
>
> The one that gave the best performance was PDFBox (open
> source, Java), but among the ones that performed well, there
> really wasn't much in it.
>
> There are two major issues in PDF extraction:
>
> (1) Page layout -- footnotes, columns, etc. PDF is (was)
> designed to provide an accurate on screen or printed
> rendering of a document (it's essentially a special version
> of PostScript), so getting the
> text back out wasn't an issue for the original designers at least.
> This means in theory that the text can appear in the file in
> any arbitrary order (the order in which it's drawn), though
> in practice it tends to be in a somewhat sensible order --
> the text tends to be in order, and columns tend to be OK too.
> Footnotes, headers, and footers, however are a more
> difficult problem.
>
> (2) Font encoding -- when a PDF document uses an embedded
> font subset, the mapping between the character codes used for
> characters and what characters they represent is generally
> unknown. The document essentially looks like "draw character
> X here" where X points to the glyph which should be drawn,
> but is otherwise arbitrary.
> All of the tools above failed on documents with embedded
> fonts in the same way. For the ACL anthology, this seemed to
> affect about 40% of the documents.
> One solution to this problem (albeit not a very elegant one)
> is to render the PDF documents with font encoding as bitmaps,
> and then run OCR on them.
>
> Hope this helps,
>
> Brett
>
>
> On 29/03/2006, at 2:35 AM, Ken Litkowski wrote:
>
> > Is anyone aware of free software that will process PDF
> documents into
> > text streams? There is a PDF2HTML (with an XML option) that will
> > create page-centric versions, but this does not really distinguish
> > text from format. I want to ignore (or be able to treat
> separately)
> > such things as headers, footnotes, tables, figures, and equations.
> > (Note that even Google retains the page- centric view.)
> >
> > Thanks,
> > Ken
> > --
> > Ken Litkowski TEL.: 301-482-0237
> > CL Research EMAIL: ken@clres.com
> > 9208 Gue Road
> > Damascus, MD 20872-1025 USA Home Page: http://www.clres.com
> >
> >
> >
>
>
>
> --------------------------------------------------------------
> Brett Powley -- PhD Candidate
> Centre for Language Technology, Macquarie University, Australia
> p: +61-402-013050 f: +61-2-90120813 e: bpowley@ics.mq.edu.au
> w: http://www.ics.mq.edu.au/~bpowley
> faciendi plures libros nullus est finis
> frequensque meditatio carnis adflictio est
> --------------------------------------------------------------
>
>
>
>
>

Next message: xing: "[Corpora-List] Graded reading materials"
Previous message: ELDA: "[Corpora-List] ELRA - Language Resources Catalogue - Update"
In reply to: Brett Powley: "Re: [Corpora-List] PDF Conversion"
Next in thread: Victor Kapustin: "RE: [Corpora-List] PDF Conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Mar 29 2006 - 15:55:08 MET DST