Hi all,
Another useful reference is the VIPS work from microsoft:
http://research.microsoft.com/research/pubs/view.aspx?tr_id=690
They are segmentating pages based upon visual layout and seem to get good
results. In my own work, I used UNIX lynx with the -dump option which seemed to
work okay (quick and dirty though):
lynx -dump file.html > file.txt
Cheers,
Paul.
-------------------------------------------
Dr. Paul Clough
Dept. Information Studies
University of Sheffield
+44 (0)114 2222664
-------------------------------------------
This archive was generated by hypermail 2b29 : Wed Aug 10 2005 - 11:03:07 MET DST