Ken,
> Is anyone aware of free software that will process PDF documents into
> text streams? There is a PDF2HTML (with an XML option) that will
> create page-centric versions, but this does not really distinguish
> text from format. I want to ignore (or be able to treat separately)
> such things as headers, footnotes, tables, figures, and equations.
> (Note that even Google retains the page-centric view.)
gsview: http://www.cs.wisc.edu/~ghost/gsview/index.htm - includes pstotext
For batch processing: pstotext - extracting plain text from PostScript:
http://www.cs.wisc.edu/~ghost/doc/pstotext.htm
Both require GhostScript (http://www.cs.wisc.edu/~ghost/doc/AFPL/get853.htm)
For me they do good job, though equations (and text fragmrnts like plot axes
marks) are polluting the text.
-- Victor Kapustin Saint-Petersburg State Univ. Russia
This archive was generated by hypermail 2b29 : Wed Mar 29 2006 - 16:58:33 MET DST