Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Rob Malouf (rmalouf@mail.sdsu.edu)
Date: Tue Aug 09 2005 - 19:43:06 MET DST

Next message: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: Ken Litkowski: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

For this task I use Python and BeautifulSoup:

It's an extremely flexible and robust DOM-ish parser, very well-suited
for extracting bits of text out of web pages.

-- 
Rob Malouf <rmalouf@mail.sdsu.edu>
Department of Linguistics and Oriental Languages
San Diego State University

Next message: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Previous message: Ken Litkowski: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 20:02:13 MET DST