RE: [Corpora-List] PDF Conversion

From: TadPiotr (tadpiotr@plusnet.pl)
Date: Wed Mar 29 2006 - 13:54:49 MET DST

  • Next message: xing: "[Corpora-List] Graded reading materials"

    Hello everybody,
    there is a commercial OCR package, FineReader, which can read a PDF file, no
    matter whether it is text or bitmap (so you do not need the conversion into
    bitmaps). It is not very efficient when you have a very complex layout (e.g.
    a tabloid) but otherwise performs quite well. It tries hard to reproduce the
    formatting of PDF pages, including headers and the like.
    Best wishes,
    Tadeusz Piotrowski

    > -----Original Message-----
    > From: owner-corpora@lists.uib.no
    > [mailto:owner-corpora@lists.uib.no] On Behalf Of Brett Powley
    > Sent: Wednesday, March 29, 2006 5:12 AM
    > To: Ken Litkowski
    > Cc: corpora@hd.uib.no
    > Subject: Re: [Corpora-List] PDF Conversion
    >
    > Hi Ken,
    >
    > The work I have been doing (with the ACL anthology) involves
    > doing precisely this. I spent some time evaluating tools to
    > do it, including:
    >
    > Adobe Reader (using Save as Text...)
    > Multivalent (Java, open source)
    > PDFBox (Java, open source)
    > XPDF (open source)
    > Etymon Pjx (open source)
    > PDFTextStream (commercial)
    > JPedal (commercial)
    > Argus (commercial)
    > 3-heights PDF extract (pdf-tools) (commercial)
    >
    > (I also looked to see whether Mac OS X provided any API for
    > text extraction since it has built-in PDF support and it
    > indexes PDF documents, but if there is an API it's not a
    > public one yet.)
    >
    > The one that gave the best performance was PDFBox (open
    > source, Java), but among the ones that performed well, there
    > really wasn't much in it.
    >
    > There are two major issues in PDF extraction:
    >
    > (1) Page layout -- footnotes, columns, etc. PDF is (was)
    > designed to provide an accurate on screen or printed
    > rendering of a document (it's essentially a special version
    > of PostScript), so getting the
    > text back out wasn't an issue for the original designers at least.
    > This means in theory that the text can appear in the file in
    > any arbitrary order (the order in which it's drawn), though
    > in practice it tends to be in a somewhat sensible order --
    > the text tends to be in order, and columns tend to be OK too.
    > Footnotes, headers, and footers, however are a more
    > difficult problem.
    >
    > (2) Font encoding -- when a PDF document uses an embedded
    > font subset, the mapping between the character codes used for
    > characters and what characters they represent is generally
    > unknown. The document essentially looks like "draw character
    > X here" where X points to the glyph which should be drawn,
    > but is otherwise arbitrary.
    > All of the tools above failed on documents with embedded
    > fonts in the same way. For the ACL anthology, this seemed to
    > affect about 40% of the documents.
    > One solution to this problem (albeit not a very elegant one)
    > is to render the PDF documents with font encoding as bitmaps,
    > and then run OCR on them.
    >
    > Hope this helps,
    >
    > Brett
    >
    >
    > On 29/03/2006, at 2:35 AM, Ken Litkowski wrote:
    >
    > > Is anyone aware of free software that will process PDF
    > documents into
    > > text streams? There is a PDF2HTML (with an XML option) that will
    > > create page-centric versions, but this does not really distinguish
    > > text from format. I want to ignore (or be able to treat
    > separately)
    > > such things as headers, footnotes, tables, figures, and equations.
    > > (Note that even Google retains the page- centric view.)
    > >
    > > Thanks,
    > > Ken
    > > --
    > > Ken Litkowski TEL.: 301-482-0237
    > > CL Research EMAIL: ken@clres.com
    > > 9208 Gue Road
    > > Damascus, MD 20872-1025 USA Home Page: http://www.clres.com
    > >
    > >
    > >
    >
    >
    >
    > --------------------------------------------------------------
    > Brett Powley -- PhD Candidate
    > Centre for Language Technology, Macquarie University, Australia
    > p: +61-402-013050 f: +61-2-90120813 e: bpowley@ics.mq.edu.au
    > w: http://www.ics.mq.edu.au/~bpowley
    > faciendi plures libros nullus est finis
    > frequensque meditatio carnis adflictio est
    > --------------------------------------------------------------
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Wed Mar 29 2006 - 15:55:08 MET DST