Hi,
We tested the MultiValent tool for extracting text from pdf files and found it is working pretty well.
For identifying figures and tables etc, you need to add a post processor using some heuristic algorithms. We tried some algorithms for tables and figures and we got a reasonably good result.
Scott Piao
-------------------
Computing Department
Lancaster University
Lancaster LA1 4WA
UK
-----Original Message-----
From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On Behalf Of Ken Litkowski
Sent: 28 March 2006 16:35
To: corpora@hd.uib.no
Subject: [Corpora-List] PDF Conversion
Is anyone aware of free software that will process PDF documents into text streams? There is a PDF2HTML (with an XML option) that will create page-centric versions, but this does not really distinguish text from format. I want to ignore (or be able to treat separately) such things as headers, footnotes, tables, figures, and equations. (Note that even Google retains the page-centric view.)
Thanks,
Ken
-- Ken Litkowski TEL.: 301-482-0237 CL Research EMAIL: ken@clres.com 9208 Gue Road Damascus, MD 20872-1025 USA Home Page: http://www.clres.com
This archive was generated by hypermail 2b29 : Wed Mar 29 2006 - 19:03:03 MET DST