Corpora: Converting PDF files

From: ramesh@clg2.bham.ac.uk
Date: Fri Dec 28 2001 - 15:54:34 MET

  • Next message: Tolkin, Steve: "RE: Corpora: Converting PDF files"

    Dear All

    In May 2001, I asked:
    I'm working on a PC with Windows95.
    I have MSWord 2000, Acrobat Reader5, and GSview3.6.
    Can anyone tell me if it is possible to convert
    PDF files into ASCII or MSWord?
    And how....

    I received many helpful replies, and
    promised to post a summary, but forgot.

    A colleague has just asked me about the same problem,
    which reminded me that I did not post the summary.

    So here it is. Apologies to anyone I have
    forgotten.

    Best
    Ramesh Krishnamurthy
    Consultant: COBUILD, Collins Dictionaries.
    Hon. Res. Fellow: University of Birmingham.
    Hon. Res. Fellow: University of Wolverhampton.

    1. Kevin McTait (UMIST):
    try the auto-email service at:
    http://www.pdfzone.com/services/access.html

    2. Ha Le An (Wolverhampton Uni):
    the simplest way is select all, copy from Acrobat Reader, and paste into
    word, but there is no way to keep the format, and images, and tables etc.

    3. Fabio Tamburini (Bologna):
    Open the file with GhostView, then choose menu EDIT, then "Text
    Extract..." and an ASCII text file will be produced...
    Pay attention to the formatting of the new file! ;-)
    I have GSview3.3, but such feature should be available also in 3.6...

    4. Mike Scott (Liverpool):
    Adobe Acrobat, the full version, not just the Reader,
    will export to various formats, haven't checked
    them all yet though.

    5. Chris Tribble (Sri Lanka):
    I do this with the full Acrobat - I use version 4. This has a text
    selection tool. Once you've clicked on this you can use Ctrl A to select
    all text in the documenn if you've selected View, Continuous. This text can
    then be pasted to a notepad or word document.

    6. Acrobat has an export to Postscript option. Then you can use a
    `postscript-to-text' converter.

    7. Everita Milconoka (Latvia):
    You may try to send your .pdf file to
    access-b@Adobe.COM
    and then in subject line you have to write either pdf2txt or pdf2htm,
    and after some minutes they will send you back the file in .txt or .htm
    format.

    8. Steven Krauwer (Netherlands):
    Adobe offers on-line and email facilities for this
    at http://access.adobe.com:80/simple_form.html

    9. Philip Resnik (Maryland):
    The solution was at
     http://www.research.compaq.com/SRC/virtualpaper/pstotext.html --
    it seems to work very nicely for pdf2txt conversion at least
    in the Unix version.

    10. Simon G. J. Smith (Birmingham):
    MSword -- www.adobe.com will do free conversions FROM word (they get emailed
    back to you, and you can only do abt 5 per email address), but I don't know about the other way round.
    To extract text from acrobat (mine is 4.0) choose the text select tool (capital T with a little
     box). Then just cut and paste the text you want. This works one page at a time.
    From ghostview (if it can read your particular PDF, sometimes doesn't work for
     me), do the whole thing at once by Edit|Text Extract. It's in the gsview help.
    You can convert whole pages to bitmaps with gsview, and I think in Acrobat you
    can select graphics from the pdf file (the Acrobat help says use the graphics select
    tool, but I can't find this tool). The bitmap file can then be viewed from Word.

    14. Jerome Richalot (Lyon)
    Acrobat 5 apparently makes the whole difference. You can
    download a plug-in from adobe.com called Access and add it on Acrobat to
    convert from pdf to rtf.



    This archive was generated by hypermail 2b29 : Fri Dec 28 2001 - 15:36:02 MET