Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Lou Burnard (lou.burnard@computing-services.oxford.ac.uk)
Date: Tue Aug 09 2005 - 20:09:46 MET DST

Next message: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Mike Maxwell: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The other tool for this purpose which no-one has (so far) mentioned is
tidy -- http://tidy.,sourceforge.net

It will take almost any html and turn it into something usable very
fast; it's also very robust and there is a choice of APIs for
integrating it into your own production system

Lou

On 9 Aug 2005, at 18:43, Rob Malouf wrote:

> Hi,
>
> For this task I use Python and BeautifulSoup:
>
> http://www.crummy.com/software/BeautifulSoup/
>
> It's an extremely flexible and robust DOM-ish parser, very well-suited
> for extracting bits of text out of web pages.
>
> --
> Rob Malouf <rmalouf@mail.sdsu.edu>
> Department of Linguistics and Oriental Languages
> San Diego State University
>
>
>
>
>
From the Macmini at Burnard Towers

Next message: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Previous message: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Mike Maxwell: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 22:16:50 MET DST