Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Vlado Keselj (vlado@cs.dal.ca)
Date: Wed Aug 10 2005 - 15:20:27 MET DST

  • Next message: Valia Kordoni: "[Corpora-List] European Masters Program in Language and Communication Technologies (LCT)"

    This is becoming a *really* long thread, but still I am tempted to add
    my $.02.

    I use a Perl script which grabs a web page, does some pre-processing,
    reports new pieces using diff command, with some post-processing.
    The algorithm is as follows:
    1. get webpage (for this one can use wget, lynx, or some other way)
    2. pre-processing (usually one wants to remove tags, but not necessarily;
                   e.g. lynx -dump, Tidy, or clean_html)
    3. if there is previous page version then
    4. | diff this with old capturing new stuff
    5. save this page to old
    6. if there was a diff then webpage is only new stuff
    7. post-processing

    Step 2 may become very interesting. Diff is very good, but still it
    depends on physical lines which are not always defined in an ideal way, so
    you may want to "reshape" them in step 2.

    If a page dramatically changes, one gets a burst of noise, but the
    "extractor" self-stabilizes with no just wonderfully. I use it as
    page-watch, run it as a cron-job, and mail any diffs.

    If anybody is interested I can send/post my Perl script (after some
    clean-up).

    --Vlado



    This archive was generated by hypermail 2b29 : Wed Aug 10 2005 - 15:59:43 MET DST