Hi,
> Is anyone aware of free software that will process PDF documents into
> text streams? There is a PDF2HTML (with an XML option) that will create
> page-centric versions, but this does not really distinguish text from
> format. I want to ignore (or be able to treat separately) such things
> as headers, footnotes, tables, figures, and equations. (Note that even
> Google retains the page-centric view.)
There was a thread on corpora list about conversion of PDF file in 2001.
Here are the links:
http://torvald.aksis.uib.no/corpora/2001-2/0133.html
and a summary of the answers:
http://torvald.aksis.uib.no/corpora/2001-4/0257.html
However, I doubt any of these programs will solve your problem. All the
programs I have used really break the text in pages. In some cases you
can write some post-processors to identify footnotes and things like
this, but very often they are formatting dependent (i.e. they will work
well only on documents from the same source - e.g. journal articles by a
publisher).
Regards,
Constantin
-- Constantin Orasan <C.Orasan@wlv.ac.uk> http://www.wlv.ac.uk/~in6093/ Lecturer in Computational Linguistics Research Group in Computational Linguistics University of Wolverhampton
This archive was generated by hypermail 2b29 : Tue Mar 28 2006 - 19:24:00 MET DST