Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Niels Ott (niels@drni.de)
Date: Fri Aug 19 2005 - 12:21:10 MET DST

Next message: n.chipere@reading.ac.uk: "Re: [Corpora-List] grapheme-to-phoneme mapping"

Previous message: Simon King: "Re: [Corpora-List] grapheme-to-phoneme mapping"
In reply to: Alexander Schutz: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Min-Yen Kan: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Alexander et al,

this is a late reply... We are currently working on a project that has
the goal to extract corpora from the web and of course came accross the
topic. Boilerplate removal was a topic we recently worked on.

Alexander Schutz wrote:
> The boilerplate removal tool worked quite well for me when I tested
> it and I've heard some good things from other people about it, too.
> check out this link and follow BTE
> http://www.smi.ucd.ie/hyppia/

This was the first approach we implemented/took over into our "toolbox".
It turned out that his code follows the right path but leads to several
problems.

Apart from beeing slow, the algorithm misses boilerplates in the middle
of a page.

Additionaly the original tag recognition code does not find all tags.

If you want to use BTE, you should be into programming to an extend that
allows you to repair/modify those regular expressions involved. You
should also check your output over and over again. HTML writers tend to
produce thinks you won't dream of in your worst nightmares. ;-)

Greetingens from Tübingen/Germany,

Niels

-- 
http://www.drni.de/niels/

Next message: n.chipere@reading.ac.uk: "Re: [Corpora-List] grapheme-to-phoneme mapping"
Previous message: Simon King: "Re: [Corpora-List] grapheme-to-phoneme mapping"
In reply to: Alexander Schutz: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Min-Yen Kan: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Aug 19 2005 - 12:38:52 MET DST