Re: [Corpora-List] PDF Conversion

From: Tom Emerson (tree@basistech.com)
Date: Tue Mar 28 2006 - 17:42:20 MET DST

  • Next message: Hamish Cunningham: "Re: [Corpora-List] PDF Conversion"

    Ken Litkowski writes:
    > Is anyone aware of free software that will process PDF documents into
    > text streams? There is a PDF2HTML (with an XML option) that will create
    > page-centric versions, but this does not really distinguish text from
    > format. I want to ignore (or be able to treat separately) such things
    > as headers, footnotes, tables, figures, and equations. (Note that even
    > Google retains the page-centric view.)

    Given that PDF is a page-centric format, so you are unlikely to find
    something that does what you are looking for: headers, footnotes,
    tables, etc. are not going to be flagged from the surrounding content
    in any special way.

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
     "You can't fake quality any more than you can fake a good meal." (W.S.B.)
    



    This archive was generated by hypermail 2b29 : Tue Mar 28 2006 - 17:53:50 MET DST