RE: Corpora: Converting PDF files

From: Mike O'Connell (Michael.Oconnell@Colorado.EDU)
Date: Fri Dec 28 2001 - 20:27:02 MET

  • Next message: Ute Römer: "Re: Corpora: when does a subcorpus become a corpus?"

    Have any of you tried the suite of PDF translation products offered by BCL
    computers? Check:

    http://www.bcl-computers.com/

    Regards,
       Michael O'Connell

    On Fri, 28 Dec 2001, Tolkin, Steve wrote:

    > Oh, I wish it were so easy!
    >
    > Summary:
    > I believe there are several problems that affect all the approaches.
    > 1. Ligatures e.g. fi, ff, ffi, Fi, etc. are emitted as special
    > control characters, e.g. the single character ^L.
    > 2. Words that had a hyphen introduced due to a line ending
    > are emitted in two pieces.
    >
    > Details:
    > 1. Just as an example here is the last part of page 7 of
    > http://www.cs.columbia.edu/~min/papers/cucs-002-01.pdf
    > that I created by copying with the text tool and then pasting into my
    > editor (emacs). Note that I have replaced the actual single
    > characters ^L and ^K by a two character pair so you would see them in
    > this email. The original file contained a single character ^L (aka
    > Control-l, C-l, octal 014, hexadecimal 0xc etc.) Note also that ^L is
    > used for two different purposes: for the ligature fi and to denote a
    > page break. ^K is used for "ff".
    >
    > <quote>
    > The relative di^Kerence between these features across headers within a
    > document seems to dictate their nesting depth. Header thus computes
    > its ^Lnal feature set based on the di^Kerences in the values of these
    > initial features in adjacent headers, shown in Table 3. This
    > corresponds to learning whether one header dominates, is dominated by,
    > or is on parity with an adjacent header. These pairwise features are
    > Header's output and are passed on to the Combiner ^Lnal machine
    > learning module.
    > 7
    > ^L
    > </quote>
    > Unfortunately the approach of having the file read by
    > Ghostview (and processed by Ghostscript) is even worse.
    > All the above errors appear, as well as another kind of error where it
    > cannot
    > read the contents due to some font problem or other issue,
    > and so uses ### instead, e.g. the last sentence becomes:
    > <quote>
    > These pairwise features are ######'s output and are
    > passed on to the ######## ^Lnal machine learning module.
    > </quote>
    >
    > Unfortuantely there are many more ligatures than this, e.g. fl,
    > including some with three letters: ffi, etc. They also
    > can occur anywhere in a word, e.g. specific became "speci^Lc".
    >
    > I seem to recall that the particular assignments used by Acrobat,
    > i.e. which control code is used for which ligature,
    > vary. (If anyone could provide more information about
    > this I would appreciate it.)
    >
    > Assuming you have a big dictionary this problem can be
    > partially remedied as follows:
    > Find all words containing a ligature and scan the text
    > looking for the assignment (i.e. on a per document level).
    > Then fix them using the inferred mapping.
    >
    > Aside: This is similar to the problem with ligatures in *.ps files
    > which the ps2text program tries to fix, e.g. here is an excerpt:
    > <quote>
    > #
    > # Process the filtered PostScript with $ps2txt_cmd and clean up its output.
    > # Substitute \ddd characters with correct combinations.
    > #
    > open(PS2TXT, "$ps2txt_cmd $dviflag < $tmpfile |") || die "Cannot run
    > ps2txt";
    > while (<PS2TXT>) {
    > next if (/^\n/o);
    > chop;
    > if (/^.*\\.*$/o) {
    > s/\\214/fi/g;
    > s/\\256/fi/g;
    > s/\\257/fl/g;
    > s/\\320//g;
    > </quote>
    >
    > 2. When converting Adobe Acrobat *.pdf file to text
    > there are often many hyphenated words.
    > Here is an example from p. 11 of the same document above.
    > <quote>
    > To further analyze CLASP's performance,
    > we assess the features used by Ripper, since it implicitly does feature
    > selec-
    > tion when constructing its hypothesis.
    > </quote>
    >
    > In certain cases the frequency of hyphenated words is very high.
    > For example the U.S. IRS presents its publications
    > using 3 columns, and so there are many hyphenated words introduced.
    >
    > Assuming you have a big dictionary this problem can be
    > partially remedied as follows:
    > If removing the hyphen produces a word, and neither fragment
    > is a word then we simply store the word, e.g.
    > "ap-propriate" becomes "appropriate".
    > My coinage for this process: "dehyphenization".
    >
    > Requests for Additional information:
    >
    > If anyone has tools, e.g. in perl, to perform either of the
    > fix up workarounds above I would like to know about them.
    >
    > It may be that these problems can be minimized by
    > the use of some options when creating the *.pdf file.
    > If so I would like to learn about that. (But I believe
    > once the file is created you are stuck.)
    >
    > Google seems to have a decent *.pdf to *.html convertor
    > and I would be interested in any information about that.
    >
    >
    > Hopefully helpfully yours,
    > Steve
    > --
    > Steven Tolkin steve.tolkin@fmr.com 617-563-0516
    > Fidelity Investments 82 Devonshire St. V1D Boston MA 02109
    > There is nothing so practical as a good theory. Comments are by me,
    > not Fidelity Investments, its subsidiaries or affiliates.
    >
    > > -----Original Message-----
    > > From: ramesh@clg2.bham.ac.uk [mailto:ramesh@clg2.bham.ac.uk]
    > > Sent: Friday, December 28, 2001 9:55 AM
    > > To: corpora@hd.uib.no
    > > Subject: Corpora: Converting PDF files
    > >
    > >
    > >
    > > Dear All
    > >
    > > In May 2001, I asked:
    > > I'm working on a PC with Windows95.
    > > I have MSWord 2000, Acrobat Reader5, and GSview3.6.
    > > Can anyone tell me if it is possible to convert
    > > PDF files into ASCII or MSWord?
    > > And how....
    > >
    > > I received many helpful replies, and
    > > promised to post a summary, but forgot.
    > >
    > > A colleague has just asked me about the same problem,
    > > which reminded me that I did not post the summary.
    > >
    > > So here it is. Apologies to anyone I have
    > > forgotten.
    > >
    > > Best
    > > Ramesh Krishnamurthy
    > > Consultant: COBUILD, Collins Dictionaries.
    > > Hon. Res. Fellow: University of Birmingham.
    > > Hon. Res. Fellow: University of Wolverhampton.
    > >
    > >
    > > 1. Kevin McTait (UMIST):
    > > try the auto-email service at:
    > > http://www.pdfzone.com/services/access.html
    > >
    > > 2. Ha Le An (Wolverhampton Uni):
    > > the simplest way is select all, copy from Acrobat Reader, and
    > > paste into
    > > word, but there is no way to keep the format, and images, and
    > > tables etc.
    > >
    > > 3. Fabio Tamburini (Bologna):
    > > Open the file with GhostView, then choose menu EDIT, then "Text
    > > Extract..." and an ASCII text file will be produced...
    > > Pay attention to the formatting of the new file! ;-)
    > > I have GSview3.3, but such feature should be available also in 3.6...
    > >
    > > 4. Mike Scott (Liverpool):
    > > Adobe Acrobat, the full version, not just the Reader,
    > > will export to various formats, haven't checked
    > > them all yet though.
    > >
    > > 5. Chris Tribble (Sri Lanka):
    > > I do this with the full Acrobat - I use version 4. This has a text
    > > selection tool. Once you've clicked on this you can use Ctrl
    > > A to select
    > > all text in the documenn if you've selected View, Continuous.
    > > This text can
    > > then be pasted to a notepad or word document.
    > >
    > > 6. Acrobat has an export to Postscript option. Then you can use a
    > > `postscript-to-text' converter.
    > >
    > > 7. Everita Milconoka (Latvia):
    > > You may try to send your .pdf file to
    > > access-b@Adobe.COM
    > > and then in subject line you have to write either pdf2txt or pdf2htm,
    > > and after some minutes they will send you back the file in
    > > .txt or .htm
    > > format.
    > >
    > > 8. Steven Krauwer (Netherlands):
    > > Adobe offers on-line and email facilities for this
    > > at http://access.adobe.com:80/simple_form.html
    > >
    > > 9. Philip Resnik (Maryland):
    > > The solution was at
    > > http://www.research.compaq.com/SRC/virtualpaper/pstotext.html --
    > > it seems to work very nicely for pdf2txt conversion at least
    > > in the Unix version.
    > >
    > > 10. Simon G. J. Smith (Birmingham):
    > > MSword -- www.adobe.com will do free conversions FROM word
    > > (they get emailed
    > > back to you, and you can only do abt 5 per email address),
    > > but I don't know about the other way round.
    > > To extract text from acrobat (mine is 4.0) choose the text
    > > select tool (capital T with a little
    > > box). Then just cut and paste the text you want. This works
    > > one page at a time.
    > > From ghostview (if it can read your particular PDF, sometimes
    > > doesn't work for
    > > me), do the whole thing at once by Edit|Text Extract. It's
    > > in the gsview help.
    > > You can convert whole pages to bitmaps with gsview, and I
    > > think in Acrobat you
    > > can select graphics from the pdf file (the Acrobat help says
    > > use the graphics select
    > > tool, but I can't find this tool). The bitmap file can then
    > > be viewed from Word.
    > >
    > > 14. Jerome Richalot (Lyon)
    > > Acrobat 5 apparently makes the whole difference. You can
    > > download a plug-in from adobe.com called Access and add it on
    > > Acrobat to
    > > convert from pdf to rtf.
    > >
    >
    >



    This archive was generated by hypermail 2b29 : Fri Dec 28 2001 - 20:29:33 MET