RE: [Corpora-List] PDF Conversion

From: Rayson, Paul (rayson@exchange.lancs.ac.uk)
Date: Wed Mar 29 2006 - 23:12:13 MET DST

  • Next message: Andrea Mulloni: "[Corpora-List] French gazetteer - summary"

    Hi Ken, all,

    Just to add to Scott's note about Multivalent, the website is:

    http://multivalent.sourceforge.net/

    We compared it to Adobe Acrobat v6 and v7 and found that for extracting
    text and preservation of text flow in two column format (such as in the
    ACL Anthology) Multivalent is much more accurate. Obviously this is for
    text-based PDFs. With image-based PDFs (not sure of the percentage of
    these in the ACL anthology) OCR seems to be the only way to go with say
    Omnipage Pro v14. Even with Multivalent and text-based PDFs, you still
    need to add post-processing procedures to deal with ligatures (ffi, fi,
    fl, ff, ffl) and extended ASCII codes (>127) in order to pop the output
    into unix/linux flavour tools. This is important for building word lists
    and finding new lexical items!

    Regards,
    Paul.

    Dr. Paul Rayson
    Director of UCREL
    Computing Department, Infolab21, South Drive, Lancaster University,
    Lancaster, LA1 4WA, UK.
    Web: http://www.comp.lancs.ac.uk/computing/users/paul/
    Tel: +44 1524 510357 Fax: +44 1524 510492

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Piao, Songlin
    Sent: 29 March 2006 15:17
    To: Ken Litkowski; corpora@hd.uib.no
    Subject: RE: [Corpora-List] PDF Conversion

    Hi,

    We tested the MultiValent tool for extracting text from pdf files and
    found it is working pretty well.

    For identifying figures and tables etc, you need to add a post processor
    using some heuristic algorithms. We tried some algorithms for tables and
    figures and we got a reasonably good result.

    Scott Piao
    -------------------
    Computing Department
    Lancaster University
    Lancaster LA1 4WA
    UK

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Ken Litkowski
    Sent: 28 March 2006 16:35
    To: corpora@hd.uib.no
    Subject: [Corpora-List] PDF Conversion

    Is anyone aware of free software that will process PDF documents into
    text streams? There is a PDF2HTML (with an XML option) that will create
    page-centric versions, but this does not really distinguish text from
    format. I want to ignore (or be able to treat separately) such things
    as headers, footnotes, tables, figures, and equations. (Note that even
    Google retains the page-centric view.)

    Thanks,
            Ken

    -- 
    Ken Litkowski                     TEL.: 301-482-0237
    CL Research                       EMAIL: ken@clres.com
    9208 Gue Road
    Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com
    



    This archive was generated by hypermail 2b29 : Thu Mar 30 2006 - 00:11:28 MET DST