RE: [Corpora-List] PDF Conversion

From: Piao, Songlin (s.piao@lancaster.ac.uk)
Date: Wed Mar 29 2006 - 16:16:46 MET DST

  • Next message: Brett Reynolds: "Re: [Corpora-List] Graded reading materials"

    Hi,

    We tested the MultiValent tool for extracting text from pdf files and found it is working pretty well.

    For identifying figures and tables etc, you need to add a post processor using some heuristic algorithms. We tried some algorithms for tables and figures and we got a reasonably good result.

    Scott Piao
    -------------------
    Computing Department
    Lancaster University
    Lancaster LA1 4WA
    UK

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On Behalf Of Ken Litkowski
    Sent: 28 March 2006 16:35
    To: corpora@hd.uib.no
    Subject: [Corpora-List] PDF Conversion

    Is anyone aware of free software that will process PDF documents into text streams? There is a PDF2HTML (with an XML option) that will create page-centric versions, but this does not really distinguish text from format. I want to ignore (or be able to treat separately) such things as headers, footnotes, tables, figures, and equations. (Note that even Google retains the page-centric view.)

    Thanks,
            Ken

    -- 
    Ken Litkowski                     TEL.: 301-482-0237
    CL Research                       EMAIL: ken@clres.com
    9208 Gue Road
    Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com
    



    This archive was generated by hypermail 2b29 : Wed Mar 29 2006 - 19:03:03 MET DST