RE: [Corpora-List] Extracting only editorial content from a HTML page

From: peetm (peet.morris@comlab.ox.ac.uk)
Date: Tue Aug 09 2005 - 12:48:06 MET DST

  • Next message: Hal Daume III: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    I started looking at this problem a couple of years ago (I've since changed
    tack - so am no longer continuing with looking at this).

    However, the approach I used was roughly as follows.

    I first used regular expressions, but soon gave up on them - it's amazing
    how well [some] browsers cope with badly formatted HTML (that can throw your
    regexps)

    So, in the end, I used an object model to load/walk the page (I used
    Microsoft's implementation of the DOM
    (http://www.webreference.com/js/column40/)) - essentially, any webpage is
    parsed and loaded into this, and then represented by a number of software
    objects that one can walk, and manipulate etc. The main advantage of this
    approach is that the DOM essentially reformats the source HTML so that it is
    consistent (adding elements as needed etc to make it 'good').

    For example, if the source contained this

    1. <p><b>this is some text</p></b>

    Or this

    2. <p><b>this is some text

    Or this

    3. <p><b>this is some text</b>

    The object model I used 'rendered' it as

    <p>
            <b>
                    This is some text
            </b>
    </p>

    So, it 'fixed' the bad tag ordering in '1', added the </b></p> in '2', and
    the </p> in '3' - very clever parsing! BTW, some DOMs do better at this
    than others of course (one of the reasons that some browsers display certain
    pages better than others do - does their DOM 'fix' the HTML?)!

    The object model also allows one to easily ignore tags (the tags are simply
    node types in the model) - or - enables one to just select (say) paragraph
    sections of a page.

    I did the latter, and then threw out any paragraphs that contained single
    sentences, or other junk stuff (like images).

    It worked pretty well, although it was a little slower than it might have
    been using regexps.

    peetm

    email: peet.morris@clg.ox.ac.uk

    addr: Computational Linguistics Group
          University of Oxford
          The Clarendon Institute
          Walton Street
          Oxford
          OX1 2HG

    =======================================

    Important: This email is intended for the use of the individual addressee(s)
    named above and may contain information that is confidential, privileged or
    unsuitable for overly sensitive persons with low self-esteem, no sense of
    humour or irrational religious beliefs.
    If you are not the intended recipient, then social etiquette demands that
    you fully appropriate the message without trace of the former sender and
    triumphantly claim it as your own. Leaving a former sender's signature on a
    "forwarded" email is very bad form and, while being only a technical breach
    of the Olympic ideal, does in fact constitute an irritating social faux pas.
    Further, sending this email to a colleague does not appear to breach the
    provisions of the Copyright Amendment (Digital Agenda) Act 2000 of the
    Commonwealth, because chances are none of the thoughts contained in this
    email are in any sense original...
    Finally, if you have received this email in error, shred it immediately,
    then add it to some nutmeg, egg whites and caster sugar. Whisk until stiff
    peaks form, then place it in a warm oven for 40 minutes. Remove promptly and
    let it stand for 2 hours before adding the decorative kiwi fruit and cream.
    Then notify me immediately by return email and eat the original message.

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Helge Thomas Hellerud
    Sent: 09 August 2005 10:43
    To: corpora@uib.no
    Subject: [Corpora-List] Extracting only editorial content from a HTML page

    Hello,

    I want to extract the article text of a HTML page (for instance the text of
    a news article). But a HTML page contains much "noise", like menus and ads.
    So I want to ask if anyone know a way to eliminate unwanted elements like
    menus and ads, and only extract the editorial article text?

    Of course, I can use Regex to look for patterns in the HTML code (by
    defining a starting point and an ending point), but the solution will be a
    hack that will not work if the pattern in the HTML page suddenly is changed.
    So do you know how to extract the content without using such a hack?

    Thanks in advance.

    Helge Thomas Hellerud



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 12:53:13 MET DST