RE: [Corpora-List] PDF Conversion

From: Piao, Songlin (s.piao@lancaster.ac.uk)
Date: Wed Mar 29 2006 - 16:16:46 MET DST

Next message: Brett Reynolds: "Re: [Corpora-List] Graded reading materials"

Previous message: Doug Cooper: "Re: [Corpora-List] Graded reading materials"
Maybe in reply to: Ken Litkowski: "[Corpora-List] PDF Conversion"
Next in thread: Rayson, Paul: "RE: [Corpora-List] PDF Conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

We tested the MultiValent tool for extracting text from pdf files and found it is working pretty well.

For identifying figures and tables etc, you need to add a post processor using some heuristic algorithms. We tried some algorithms for tables and figures and we got a reasonably good result.

Scott Piao
-------------------
Computing Department
Lancaster University
Lancaster LA1 4WA
UK

-----Original Message-----
From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On Behalf Of Ken Litkowski
Sent: 28 March 2006 16:35
To: corpora@hd.uib.no
Subject: [Corpora-List] PDF Conversion

Is anyone aware of free software that will process PDF documents into text streams? There is a PDF2HTML (with an XML option) that will create page-centric versions, but this does not really distinguish text from format. I want to ignore (or be able to treat separately) such things as headers, footnotes, tables, figures, and equations. (Note that even Google retains the page-centric view.)

Thanks,
Ken

-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken@clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com

Next message: Brett Reynolds: "Re: [Corpora-List] Graded reading materials"
Previous message: Doug Cooper: "Re: [Corpora-List] Graded reading materials"
Maybe in reply to: Ken Litkowski: "[Corpora-List] PDF Conversion"
Next in thread: Rayson, Paul: "RE: [Corpora-List] PDF Conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Mar 29 2006 - 19:03:03 MET DST