Re: [Corpora-List] PDF Conversion

From: Constantin Orasan (C.Orasan@wlv.ac.uk)
Date: Tue Mar 28 2006 - 19:09:58 MET DST

Next message: Kristofer Franzén: "Re: [Corpora-List] PDF Conversion"

Previous message: Alexander Osherenko: "Re: [Corpora-List] PDF Conversion"
In reply to: Ken Litkowski: "[Corpora-List] PDF Conversion"
Next in thread: Kristofer Franzén: "Re: [Corpora-List] PDF Conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

> Is anyone aware of free software that will process PDF documents into
> text streams? There is a PDF2HTML (with an XML option) that will create
> page-centric versions, but this does not really distinguish text from
> format. I want to ignore (or be able to treat separately) such things
> as headers, footnotes, tables, figures, and equations. (Note that even
> Google retains the page-centric view.)
There was a thread on corpora list about conversion of PDF file in 2001.
Here are the links:
http://torvald.aksis.uib.no/corpora/2001-2/0133.html
and a summary of the answers:
http://torvald.aksis.uib.no/corpora/2001-4/0257.html

However, I doubt any of these programs will solve your problem. All the
programs I have used really break the text in pages. In some cases you
can write some post-processors to identify footnotes and things like
this, but very often they are formatting dependent (i.e. they will work
well only on documents from the same source - e.g. journal articles by a
publisher).

Regards,

Constantin

-- 
Constantin Orasan <C.Orasan@wlv.ac.uk>
http://www.wlv.ac.uk/~in6093/
Lecturer in Computational Linguistics
Research Group in Computational Linguistics
University of Wolverhampton

Next message: Kristofer Franzén: "Re: [Corpora-List] PDF Conversion"
Previous message: Alexander Osherenko: "Re: [Corpora-List] PDF Conversion"
In reply to: Ken Litkowski: "[Corpora-List] PDF Conversion"
Next in thread: Kristofer Franzén: "Re: [Corpora-List] PDF Conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Mar 28 2006 - 19:24:00 MET DST