Re: [Corpora-List] PDF Conversion

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Wed Mar 29 2006 - 03:42:07 MET DST

  • Next message: Brett Powley: "Re: [Corpora-List] PDF Conversion"

    Tom Emerson wrote:
    > Ken Litkowski writes:
    >> Is anyone aware of free software that will process PDF documents into
    >> text streams?
    > ...
    > Given that PDF is a page-centric format, so you are unlikely to find
    > something that does what you are looking for: headers, footnotes,
    > tables, etc. are not going to be flagged from the surrounding content
    > in any special way.

    I suspect (but don't know) that Tom's comments here (which I can second
    from experience) are going to affect _any_ PDF-to-text converter. In
    addition to his list of problems (and those others have mentioned, e.g.
    the fact that some PDFs are basically bitmaps), here are some problems
    we've encountered:

    1) Multi-column text may come out as
            <line1 from column1> <line1 from column2>
            <line2 from column1> <line2 from column2>
        etc., rather than what you want:
            <line1 from column1>
            <line2 from column1>
            ...
            <line1 from column2>
            <line2 from column2>

    2) Character encoding can be a mess. In some cases it's sheer
    gibberish; in other cases, you get something that is nearly "correct",
    but with exceptions. I saw Tigrinya (Ethiopic language) text that came
    out of PDFs as Unicode characters in the Ethiopic range, _except_ for
    about five alphabetic characters, which came out in the ASCII range. We
    were able to figure out what the particular characters were supposed to
    be--something like glottal stop + a vowel, IIRC--and map them correctly.
      But I've always wondered why they did it that way. In some of the
    gibberish cases (Bengali, IIRC), I suspect it was just a proprietary
    encoding. But I've seen English text extract as gibberish, so it almost
    looks like some kind of encryption for the purpose of preventing you
    from extracting the text.

    3) If the original is s.t. like a newspaper or newsletter, a story may
    continue on a later page, with other stories in between, leaving you to
    try to piece together a single story that has interruptions of text from
    other stories (and as Tom writes, from headers and footers). In one
    case like this, we were resigned to manually piecing the stories back
    together (it wasn't a language we knew, but you could sort of figure it
    out), when someone (I believe it was Julie Medero, at the LDC)
    discovered that the PDF files in question were built on the fly from
    plain text source files. We happily took the text files instead!

    Assuming that all these kinds of problems are inherit in the way the
    text is stored in (non-bitmap) PDFS, a converter would have to be very
    smart indeed to get well-structured text out reliably.

    But if all you want is to mine new terms, why worry about the formatting
    in the first place?

        Mike Maxwell



    This archive was generated by hypermail 2b29 : Wed Mar 29 2006 - 04:54:36 MET DST