Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Paul Clough (p.d.clough@sheffield.ac.uk)
Date: Wed Aug 10 2005 - 10:55:55 MET DST

  • Next message: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    Hi all,

    Another useful reference is the VIPS work from microsoft:

    http://research.microsoft.com/research/pubs/view.aspx?tr_id=690

    They are segmentating pages based upon visual layout and seem to get good
    results. In my own work, I used UNIX lynx with the -dump option which seemed to
    work okay (quick and dirty though):

    lynx -dump file.html > file.txt

    Cheers,

    Paul.

    -------------------------------------------
    Dr. Paul Clough
    Dept. Information Studies
    University of Sheffield

    +44 (0)114 2222664
    -------------------------------------------



    This archive was generated by hypermail 2b29 : Wed Aug 10 2005 - 11:03:07 MET DST