Re: [Corpora-List] PDF Conversion

From: Hamish Cunningham (hamish@dcs.shef.ac.uk)
Date: Tue Mar 28 2006 - 17:50:16 MET DST

  • Next message: Pavel Vondřička: "Re: [Corpora-List] PDF Conversion"

    Ted Briscoe's group in Cambridge have a PDF converter - you might contact
    them

    Best

    Hamish

    Tom Emerson wrote:
    > Ken Litkowski writes:
    >
    >>Is anyone aware of free software that will process PDF documents into
    >>text streams? There is a PDF2HTML (with an XML option) that will create
    >>page-centric versions, but this does not really distinguish text from
    >>format. I want to ignore (or be able to treat separately) such things
    >>as headers, footnotes, tables, figures, and equations. (Note that even
    >>Google retains the page-centric view.)
    >
    >
    > Given that PDF is a page-centric format, so you are unlikely to find
    > something that does what you are looking for: headers, footnotes,
    > tables, etc. are not going to be flagged from the surrounding content
    > in any special way.
    >

    -- 
    Hamish
    http://www.dcs.shef.ac.uk/~hamish/
    



    This archive was generated by hypermail 2b29 : Tue Mar 28 2006 - 18:02:57 MET DST