Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Vlado Keselj (vlado@cs.dal.ca)
Date: Wed Aug 10 2005 - 15:20:27 MET DST

Next message: Valia Kordoni: "[Corpora-List] European Masters Program in Language and Communication Technologies (LCT)"

Previous message: Marco Baroni: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Marco Baroni: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Vlado Keselj: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Vlado Keselj: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This is becoming a *really* long thread, but still I am tempted to add
my $.02.

I use a Perl script which grabs a web page, does some pre-processing,
reports new pieces using diff command, with some post-processing.
The algorithm is as follows:
1. get webpage (for this one can use wget, lynx, or some other way)
2. pre-processing (usually one wants to remove tags, but not necessarily;
e.g. lynx -dump, Tidy, or clean_html)
3. if there is previous page version then
4. | diff this with old capturing new stuff
5. save this page to old
6. if there was a diff then webpage is only new stuff
7. post-processing

Step 2 may become very interesting. Diff is very good, but still it
depends on physical lines which are not always defined in an ideal way, so
you may want to "reshape" them in step 2.

If a page dramatically changes, one gets a burst of noise, but the
"extractor" self-stabilizes with no just wonderfully. I use it as
page-watch, run it as a cron-job, and mail any diffs.

If anybody is interested I can send/post my Perl script (after some
clean-up).

--Vlado

Next message: Valia Kordoni: "[Corpora-List] European Masters Program in Language and Communication Technologies (LCT)"
Previous message: Marco Baroni: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Marco Baroni: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Vlado Keselj: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Vlado Keselj: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Aug 10 2005 - 15:59:43 MET DST