Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Martin Thomas (martint@comp.leeds.ac.uk)
Date: Wed Aug 10 2005 - 10:09:53 MET DST

  • Next message: Paul Clough: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    Hi all,

    I have been playing with a very crude approach to this problem
    (extraction of boilerplate and other furniture such as
    header/footer/navigation panels), which I don't think anyone has
    mentioned here yet...

    First I extract the text from the HTML (I use lynx -dump for this).
    Next I count the number of times every line in the collection of files
    occurs. Then I (manually) scan through the generated list and set a
    more or less arbitrary threshold for filtering out the stuff I don't
    want, e.g. any line that occurs more than 10 times (keeping an eye out
    for lines which may have a high frequency for some other reason).

    This edited list is then used as a filter - all lines which feature in
    it are deleted from the collection of files.

    Despite its dirtiness, this might have certain advantages. It seems to
    work quite robustly and is very quick (at least, for modest corpora of
    ~1 million words). It allows you to remove things like "More >>" links
    which often occur at the end of paras, rather than in header/footer or
    navigation panels. Moreover, you are able to keep information about the
    frequency of boilerplate and furniture elements, while filtering them
    out of the main corpus.

    On the down side, it requires tailoring to each website from which you
    wish to collect data - which in our specific case happens not to be a
    problem. Some revision would be necessary if the corpus were to be
    updated with new material from a previously collected site. It is also
    likely that some things are cut which you might want to keep (e.g.
    frequent subheadings, which occur on many pages, whilst not coming under
    header/footer/navigation panel categories). Similarly, some unwanted
    text gets through.

    On the whole it seems to work well enough for us, though.

    Best,
    Martin Thomas

    Centre for Translation Studies
    University of Leeds

    PS - I hope this message doesn't appear twice - sorry if it does - I
    originally sent it from a non-member email account.



    This archive was generated by hypermail 2b29 : Wed Aug 10 2005 - 10:31:24 MET DST