RE: [Corpora-List] PDF Conversion

From: Victor Kapustin (victor.kapustin@gmail.com)
Date: Wed Mar 29 2006 - 16:34:57 MET DST

  • Next message: Aline Villavicencio: "[Corpora-List] Final CFP -- COLING-ACL Workshop on Multiword Expressions:Identifying and Exploiting Underlying Properties"

    Ken,

    > Is anyone aware of free software that will process PDF documents into
    > text streams? There is a PDF2HTML (with an XML option) that will
    > create page-centric versions, but this does not really distinguish
    > text from format. I want to ignore (or be able to treat separately)
    > such things as headers, footnotes, tables, figures, and equations.
    > (Note that even Google retains the page-centric view.)
    gsview: http://www.cs.wisc.edu/~ghost/gsview/index.htm - includes pstotext

    For batch processing: pstotext - extracting plain text from PostScript:
    http://www.cs.wisc.edu/~ghost/doc/pstotext.htm

    Both require GhostScript (http://www.cs.wisc.edu/~ghost/doc/AFPL/get853.htm)

    For me they do good job, though equations (and text fragmrnts like plot axes
    marks) are polluting the text.

    --
    Victor Kapustin
    Saint-Petersburg State Univ.
    Russia
    



    This archive was generated by hypermail 2b29 : Wed Mar 29 2006 - 16:58:33 MET DST