Re: [Corpora-List] PDF Conversion

From: Kristofer Franzén (franzen@sics.se)
Date: Tue Mar 28 2006 - 19:06:09 MET DST

  • Next message: Paula Newman: "RE: [Corpora-List] Where can I download the CACM test set?"

    Recently, I've tried to evaluate both commercial and free software for
    pdf to text conversion, and I've come to the depressing conclusion that
    there is really nothing better to find than Adobe Reader (6.0) Save as
    Text... function.

    But I don't think that I am familiar with the converter by Ted Briscoe's
    group, mentioned by Hamish Cunningham in a reply to your post.

    My experience is that you cannot find a tool that can handle 1. the
    separation of figure and table captions from the running text 2. unusual
    characters and symbols (greek, math) 3. the different ways of coding pdf.

    Best,

    Kristofer Franzén

    Ken Litkowski wrote:

    > Is anyone aware of free software that will process PDF documents into
    > text streams? There is a PDF2HTML (with an XML option) that will
    > create page-centric versions, but this does not really distinguish
    > text from format. I want to ignore (or be able to treat separately)
    > such things as headers, footnotes, tables, figures, and equations.
    > (Note that even Google retains the page-centric view.)
    >
    > Thanks,
    > Ken



    This archive was generated by hypermail 2b29 : Tue Mar 28 2006 - 19:57:04 MET DST