Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Wed Aug 10 2005 - 04:02:38 MET DST

Next message: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Marco Baroni: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Lou Burnard wrote:
> The other tool for this purpose which no-one has (so far) mentioned is
> tidy -- http://tidy.,sourceforge.net
>
> It will take almost any html and turn it into something usable very
> fast; it's also very robust and there is a choice of APIs for
> integrating it into your own production system

I think the original question was how to deal with the boilerplate text
that often appears at the top and bottom of html files, so it doesn't get
included in the text one extracts from a web page. (If that wasn't the
original question, it's mine :-).) By "boilerplate", I mean things like
copyright notices, "Enroll in our big extravaganza", "Download our super
font", menu items, and other such trash.

I dealt with that in some work I did by using regexs tailored to the sort
of trash that each web site used. But the regexs had to be tailored, they
were fragile when a site changed its boilerplate (as someone else pointed
out), and you could in fact run out of stack space in Python (and
presumably other interpreters), so you had to be careful how you designed
your regexs. All in all, not a very good solution.

I should look back and see if I can just skip to the first <p> tag, but
again, I doubt whether that will work for all sites: some of them put the
main text into tables, IIRC.

Possibly I could do some sort of language ID (since all of the texts I
wanted were non-English). But then again, some of the menu items were
non-English. Or given that this stuff is boilerplate, and tends to change
slowly at any one web site, maybe I could train a recognizer for the
boilerplate (as opposed to a recognizer for the text). Has anyone tried
that? (One piece that sometimes occurs inside the boilerplate, and which
changes rapidly, is the date. Again, I used a regex "solution".)

I haven't tried the "look for the place where you start to get a higher
text-to-tag ratio" method that was also mentioned.

It looks to me like 'tidy' is intended to handle incorrectly structure
html. Can it be used to extract text, and in particular to throw away
header and footer boilerplate?

-- 
	Mike Maxwell
	Linguistic Data Consortium
	maxwell@ldc.upenn.edu

Next message: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Previous message: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Tom Emerson: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Marco Baroni: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Aug 10 2005 - 05:43:30 MET DST