Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Niels Ott (niels@drni.de)
Date: Fri Aug 19 2005 - 12:21:10 MET DST

  • Next message: n.chipere@reading.ac.uk: "Re: [Corpora-List] grapheme-to-phoneme mapping"

    Alexander et al,

    this is a late reply... We are currently working on a project that has
    the goal to extract corpora from the web and of course came accross the
    topic. Boilerplate removal was a topic we recently worked on.

    Alexander Schutz wrote:
    > The boilerplate removal tool worked quite well for me when I tested
    > it and I've heard some good things from other people about it, too.
    > check out this link and follow BTE
    > http://www.smi.ucd.ie/hyppia/

    This was the first approach we implemented/took over into our "toolbox".
    It turned out that his code follows the right path but leads to several
    problems.

    Apart from beeing slow, the algorithm misses boilerplates in the middle
    of a page.

    Additionaly the original tag recognition code does not find all tags.

    If you want to use BTE, you should be into programming to an extend that
    allows you to repair/modify those regular expressions involved. You
    should also check your output over and over again. HTML writers tend to
    produce thinks you won't dream of in your worst nightmares. ;-)

    Greetingens from Tübingen/Germany,

      Niels

    -- 
    http://www.drni.de/niels/
    



    This archive was generated by hypermail 2b29 : Fri Aug 19 2005 - 12:38:52 MET DST