Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Rob Malouf (rmalouf@mail.sdsu.edu)
Date: Tue Aug 09 2005 - 19:43:06 MET DST

  • Next message: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    Hi,

    For this task I use Python and BeautifulSoup:

    http://www.crummy.com/software/BeautifulSoup/

    It's an extremely flexible and robust DOM-ish parser, very well-suited
    for extracting bits of text out of web pages.

    -- 
    Rob Malouf <rmalouf@mail.sdsu.edu>
    Department of Linguistics and Oriental Languages
    San Diego State University
    



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 20:02:13 MET DST