Re: [Corpora-List] PDF Conversion

From: Constantin Orasan (C.Orasan@wlv.ac.uk)
Date: Tue Mar 28 2006 - 19:09:58 MET DST

  • Next message: Kristofer Franzén: "Re: [Corpora-List] PDF Conversion"

    Hi,

    > Is anyone aware of free software that will process PDF documents into
    > text streams? There is a PDF2HTML (with an XML option) that will create
    > page-centric versions, but this does not really distinguish text from
    > format. I want to ignore (or be able to treat separately) such things
    > as headers, footnotes, tables, figures, and equations. (Note that even
    > Google retains the page-centric view.)
    There was a thread on corpora list about conversion of PDF file in 2001.
    Here are the links:
    http://torvald.aksis.uib.no/corpora/2001-2/0133.html
    and a summary of the answers:
    http://torvald.aksis.uib.no/corpora/2001-4/0257.html

    However, I doubt any of these programs will solve your problem. All the
    programs I have used really break the text in pages. In some cases you
    can write some post-processors to identify footnotes and things like
    this, but very often they are formatting dependent (i.e. they will work
    well only on documents from the same source - e.g. journal articles by a
    publisher).

    Regards,

    Constantin

    -- 
    Constantin Orasan <C.Orasan@wlv.ac.uk>
    http://www.wlv.ac.uk/~in6093/
    Lecturer in Computational Linguistics
    Research Group in Computational Linguistics
    University of Wolverhampton
    



    This archive was generated by hypermail 2b29 : Tue Mar 28 2006 - 19:24:00 MET DST