Re: [Corpora-List] pdfs/ OCR question

From: John F. Sowa (sowa@bestweb.net)
Date: Tue Dec 12 2006 - 04:31:25 MET

  • Next message: Brett Powley: "Re: [Corpora-List] pdfs/ OCR question"

    That depends on how the PDF was created:

    > interesting to know that pdf files store text info separately!

    Some PDF files are generated by scanning each page of a book or
    article into an image format (GIF or TIFF, for example). In such
    a PDF file, there are no character strings internally, and some
    kind of OCR is necessary to convert the image into a character
    string. The OCR process might convert an image for "the"
    into the character string "die".

    But if the PDF file had been generated from a text string in
    any textual form, such as HTML, LaTeX, TXT, ODT, or DOC formats,
    the internal PDF file preserves the original text strings. If
    you copy and paste text from a PDF of that kind into an editor
    for some other kind of text, such as OpenOffice or MS Word, you
    will get a copy of the original character string, but some or
    all of the formatting info may be lost. That process would
    never convert "the" into "die".

    There are some caveats, however. Some PDF files may have
    special characters for ligatures, such as fi, fl, ff, etc.
    Even though the ligatures are represented in character strings,
    a copy & paste from such files to another editor may convert
    the ligature to an unrecognized character. (Some OCR systems
    also have difficulty with ligatures because the letters "f"
    and "i" or "l" are too close together for easy recognition.)

    John Sowa



    This archive was generated by hypermail 2b29 : Tue Dec 12 2006 - 04:29:15 MET