Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Lou Burnard (lou.burnard@computing-services.oxford.ac.uk)
Date: Tue Aug 09 2005 - 20:09:46 MET DST

  • Next message: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    The other tool for this purpose which no-one has (so far) mentioned is
    tidy -- http://tidy.,sourceforge.net

    It will take almost any html and turn it into something usable very
    fast; it's also very robust and there is a choice of APIs for
    integrating it into your own production system

    Lou

    On 9 Aug 2005, at 18:43, Rob Malouf wrote:

    > Hi,
    >
    > For this task I use Python and BeautifulSoup:
    >
    > http://www.crummy.com/software/BeautifulSoup/
    >
    > It's an extremely flexible and robust DOM-ish parser, very well-suited
    > for extracting bits of text out of web pages.
    >
    > --
    > Rob Malouf <rmalouf@mail.sdsu.edu>
    > Department of Linguistics and Oriental Languages
    > San Diego State University
    >
    >
    >
    >
    >
     From the Macmini at Burnard Towers



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 22:16:50 MET DST