Re: [Corpora-List] PDF Conversion

From: Brett Powley (bpowley@ics.mq.edu.au)
Date: Wed Mar 29 2006 - 05:11:33 MET DST

  • Next message: Timothy Baldwin: "[Corpora-List] Australia: Frontiers in Linguistically Annotated Corpora 2006 at Coling-ACL 2006 --- 2nd CFP (** REVISED SUBMISSION DEADLINE: APRIL 10 **)"

    Hi Ken,

    The work I have been doing (with the ACL anthology) involves doing
    precisely this. I spent some time evaluating tools to do it, including:

    Adobe Reader (using Save as Text...)
    Multivalent (Java, open source)
    PDFBox (Java, open source)
    XPDF (open source)
    Etymon Pjx (open source)
    PDFTextStream (commercial)
    JPedal (commercial)
    Argus (commercial)
    3-heights PDF extract (pdf-tools) (commercial)

    (I also looked to see whether Mac OS X provided any API for text
    extraction since it has built-in PDF support and it indexes PDF
    documents, but if there is an API it's not a public one yet.)

    The one that gave the best performance was PDFBox (open source,
    Java), but among the ones that performed well, there really wasn't
    much in it.

    There are two major issues in PDF extraction:

    (1) Page layout -- footnotes, columns, etc. PDF is (was) designed to
    provide an accurate on screen or printed rendering of a document
    (it's essentially a special version of PostScript), so getting the
    text back out wasn't an issue for the original designers at least.
    This means in theory that the text can appear in the file in any
    arbitrary order (the order in which it's drawn), though in practice
    it tends to be in a somewhat sensible order -- the text tends to be
    in order, and columns tend to be OK too. Footnotes, headers, and
    footers, however are a more difficult problem.

    (2) Font encoding -- when a PDF document uses an embedded font
    subset, the mapping between the character codes used for characters
    and what characters they represent is generally unknown. The
    document essentially looks like "draw character X here" where X
    points to the glyph which should be drawn, but is otherwise arbitrary.
    All of the tools above failed on documents with embedded fonts in the
    same way. For the ACL anthology, this seemed to affect about 40% of
    the documents.
    One solution to this problem (albeit not a very elegant one) is to
    render the PDF documents with font encoding as bitmaps, and then run
    OCR on them.

    Hope this helps,

    Brett

    On 29/03/2006, at 2:35 AM, Ken Litkowski wrote:

    > Is anyone aware of free software that will process PDF documents
    > into text streams? There is a PDF2HTML (with an XML option) that
    > will create page-centric versions, but this does not really
    > distinguish text from format. I want to ignore (or be able to
    > treat separately) such things as headers, footnotes, tables,
    > figures, and equations. (Note that even Google retains the page-
    > centric view.)
    >
    > Thanks,
    > Ken
    > --
    > Ken Litkowski TEL.: 301-482-0237
    > CL Research EMAIL: ken@clres.com
    > 9208 Gue Road
    > Damascus, MD 20872-1025 USA Home Page: http://www.clres.com
    >
    >
    >

    --------------------------------------------------------------
    Brett Powley -- PhD Candidate
    Centre for Language Technology, Macquarie University, Australia
    p: +61-402-013050 f: +61-2-90120813 e: bpowley@ics.mq.edu.au
    w: http://www.ics.mq.edu.au/~bpowley
    faciendi plures libros nullus est finis
    frequensque meditatio carnis adflictio est
    --------------------------------------------------------------



    This archive was generated by hypermail 2b29 : Wed Mar 29 2006 - 05:27:53 MET DST