Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Paul Clough (p.d.clough@sheffield.ac.uk)
Date: Wed Aug 10 2005 - 10:55:55 MET DST

Next message: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Copperman, Max: "RE: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

Another useful reference is the VIPS work from microsoft:

http://research.microsoft.com/research/pubs/view.aspx?tr_id=690

They are segmentating pages based upon visual layout and seem to get good
results. In my own work, I used UNIX lynx with the -dump option which seemed to
work okay (quick and dirty though):

lynx -dump file.html > file.txt

Cheers,

Paul.

-------------------------------------------
Dr. Paul Clough
Dept. Information Studies
University of Sheffield

+44 (0)114 2222664
-------------------------------------------

Next message: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Previous message: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Copperman, Max: "RE: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Aug 10 2005 - 11:03:07 MET DST