[Corpora-List] SUMMARY: Extracting only editorial content from a HTML page

From: Helge Thomas Hellerud (helgetho@stud.ntnu.no)
Date: Wed Aug 10 2005 - 22:38:28 MET DST

  • Next message: Vlado Keselj: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    Hello,

    Thanks to everyone who answered my question. The response has been
    enormously. Some answers are related to my description of using regex
    looking for patterns, followed by HTML cleaning. Here is a summary of other
    approaches (some replies are also sent directly to the list):

    - Aidan Finn's BTE module: http://www.smi.ucd.ie/hyppia/.

    - Java based sample, needs to be modified:
    http://javaalmanac.com/egs/javax.swing.text.html/GetText.html

    - An object model to load/walk the page (I used Microsoft's implementation
    of the DOM (http://www.webreference.com/js/column40/)) - essentially, any
    webpage is parsed and loaded into this, and then represented by a number of
    software objects that one can walk, and manipulate etc. The main advantage
    of this approach is that the DOM essentially reformats the source HTML so
    that it is consistent (adding elements as needed etc to make it 'good').

    - If you have access to several articles from the same source, you can
    delete everything that is equal (or very similar) starting from the top and
    bottom across articles.

    - I have used UNIX lynx (with -dump option) to extract plain text from HTML
    pages which gets rid of most of the unwanted text you mentioned. I have also
    been looking at some research from microsoft based on DOM and segmenting web
    pages baed on visual appearance. They are able to spot regular patterns on
    web pages like ads, menus etc.

    - I looked at this a while ago; the solution I came up with is not perfect,
    but seems to do a pretty good job, at least with news-like web pages. The
    key idea is to look for a subsequence of the text with the highest (# of
    words) to (# of html tags) ratio. You can fairly easily do this if you
    first tokenize the web page into seqs of html tags and seqs of
    non-html-tags, and then do simple dynamic programming to find the longest
    contiguous sequence that's "mostly" words. The only thing this misses is on
    web pages that look like "First paragraph of article <some ads> rest of
    article." In these cases, the first paragraph is often lost. You could
    probably hueristically fix this, but I didn't.

    - Since I was scrapping text from a limited number of Albanian language
    websites, it was easy for me to search for repeated text in every page
    coming from the same site. The repeated text was removed. This meant that I
    had only one copy of pages containing everything. One of the sites I was
    spidering changed its format three times in two months which generated quite
    a bit of noise. The only way to get rid of it was to get back to regex. I
    ended up using only regex in the end. As for the "sudden" changes, you could
    use the absence of text repetition as a signal that there is a change and,
    then, modify the regex accordingly.

    - We have developed a tool in java, available through sourceforge to help
    people do this task and others where some fragment of the web page needs to
    be identified and/or extracted. We have experimented with tagging and
    extracting the main text, navigation links, title, headers, etc. from news
    stories on various sites on the web. Our software, PARCELS, also partially
    handles sites that use XHTML/CSS (e.g. <DIV> tags) to place text.

    You can find PARCELS on sourceforge at http://parcels.sourceforge.net

    It may be overkill for a simple problem, but if you need to extract the same
    type of information from multiple websites with different formats, this
    toolkit may be of help.

    - My approach is based on the HTML tags, rather than the more elaborate DOMs
    and REs (as suggested in other responses to this message). The problem in
    basic HTML is that <p>'s don't have to be closed. But, you can assume that
    if you've got an opening <p>, then any prior one is now closed. So, now
    you've got a stretch of material and you can examine it for any other tags,
    which almost always have a closing tag, and remove those tags, and perhaps
    what's in them. This will get rid of links, <img> elements, etc. This is
    the starting point for your algorithm, and you then refine it from there.
    (One main problem with a <p> is that it may be embedded in a table, so you
    have to decide what you want to do with tabular material.)

    Clearly, basic HMTL is the most difficult; XHTML wouldn't have as many
    problems. And, then you start getting into all sorts of other web pages.
    Unless you have the resources (both time and money) to devote to a more
    elaborate solution, you can do surprisingly well.

    - For this task I use Python and BeautifulSoup:
    http://www.crummy.com/software/BeautifulSoup/. It's an extremely flexible
    and robust DOM-ish parser, very well-suited for extracting bits of text out
    of web pages.

    - The other tool for this purpose which no-one has (so far) mentioned is
    tidy -- http://tidy.sourceforge.net. It will take almost any html and turn
    it into something usable very fast; it's also very robust and there is a
    choice of APIs for integrating it into your own production system.

    - First I extract the text from the HTML (I use lynx -dump for this).
    Next I count the number of times every line in the collection of files
    occurs. Then I (manually) scan through the generated list and set a more or
    less arbitrary threshold for filtering out the stuff I don't want, e.g. any
    line that occurs more than 10 times (keeping an eye out for lines which may
    have a high frequency for some other reason).

    This edited list is then used as a filter - all lines which feature in it
    are deleted from the collection of files.

    Despite its dirtiness, this might have certain advantages. It seems to work
    quite robustly and is very quick (at least, for modest corpora of
    ~1 million words). It allows you to remove things like "More >>" links
    which often occur at the end of paras, rather than in header/footer or
    navigation panels. Moreover, you are able to keep information about the
    frequency of boilerplate and furniture elements, while filtering them out of
    the main corpus.

    On the down side, it requires tailoring to each website from which you wish
    to collect data - which in our specific case happens not to be a problem.
    Some revision would be necessary if the corpus were to be updated with new
    material from a previously collected site. It is also likely that some
    things are cut which you might want to keep (e.g.
    frequent subheadings, which occur on many pages, whilst not coming under
    header/footer/navigation panel categories). Similarly, some unwanted text
    gets through.

    On the whole it seems to work well enough for us, though.

    - Another useful reference is the VIPS work from microsoft:
    http://research.microsoft.com/research/pubs/view.aspx?tr_id=690

    Helge Thomas Hellerud



    This archive was generated by hypermail 2b29 : Wed Aug 10 2005 - 22:53:03 MET DST