Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Wed Aug 10 2005 - 11:23:52 MET DST

  • Next message: Vlado Keselj: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    If the original question (or at least Mike's new question) was how to
    remove boilerplate (as opposed to cleaning HTML and representing HTML
    document structure), we got reasonably good results with the "difference
    between token count and tag count" method that has already been mentioned,
    and that is implemented in Aidan Finn's BTE modules.

    1) remove javascript and comments

    2) transform all html tags into a special symbol, e.g. XXTAG

    3) tokenize everything else (a rough tokenization method -- e.g., split on
    whitespace -- is good enough for the purpose) and map each token to a
    special symbol, e.g. TOKEN, so that you get a representation of the
    document as:

    ...
    XXTAG XXTAG TOKEN TOKEN TOKEN XXTAG XXTAG TOKEN TOKEN TOKEN
    TOKEN TOKEN XXTAG TOKEN TOKEN XXTAG TOKEN TOKEN XXTAG
    XXTAG XXTAG TOKEN TOKEN
    ...

    4) for each possible beginning and ending point, compute
    N(TOKEN) - N(XXTAG)
    and choose the stretch where this quantity is highest.

    5) print only the TOKENs from the chosen stretch.

    Typically, the "low tag density" area found in this way does not contain
    boilerplate.

    Notes on the implementation:

    - it's probably safe to start looking for the best stretch from the first
    TOKEN after </head>

    - you can make the implementation a lot more efficient by observing that
    the best stretch will always begin at the beginning of a TOKEN sequence
    and end at the end of a TOKEN sequence, and thus: 1) there is no need to
    even consider stretches that being/end with a XXTAG; 2) there is no need
    to consider stretches that begin with a TOKEN preceded by another TOKEN or
    end with a TOKEN followed by another TOKEN [indeed, the document can be
    represented in a compact way as an array of counts where the even number
    positions -- treating 0 as even -- correspond to counts of sequences of
    TOKEN (positive values) and the odd positions correspond to counts of
    sequences of XXTAG (negative) values] {if you start counting from the
    first TOKEN after</head>, the first slot will always be occupied by a
    TOKEN}

    - I wrote a module in perl that includes large chunks of Finn's BTE code
    but uses the previous observations to speed up the computation, and, at
    least in my experiments, it is much faster than Finn's original
    implementation: if anybody is interested, I can send it to them (I plan to
    make it public, eventually, but at the moment it is "scantily
    documented" to say the least...)

    - we have only experimented with German, Italian and English (and my
    script returns latin1 output); languages that do not use whitespace to
    separate tokens would require a more intelligent tokenization strategy,
    and I have no idea of whether the method would still work on agglutinative
    languages, where the number of "tokens" is likely to be lower than for the
    synthetic languages we worked on.

    Notes on the results:

    - the method is good if you are collecting a lot of documents from
    different sites, since it does not depend on site-specific
    characteristics; of course, if you are collecting documents from few
    sites, you are likely to get better results by studying their structure
    and extracting the interesting contents with appropriate regular
    expressions.

    - If you are collecting a very large corpus, the method has an advantage
    over looking for sentences/paragraphs that are repeated across documents
    in that you only need to analyze one document at a time. This means that
    the demand on RAM is more or less constant, no matter how large the corpus
    is, and that if you have X machines available you can split the corpus
    into X parts and process them separately in parallel (of course, methods
    requiring global statistics could also be parallelized, but that would
    require more work).

    - The method tends to have better precision than recall: you get a section
    of a page that is relatively clean, but it is possible that this is just a
    fragment of the clean part of the page, that could be larger than what you
    got.

    - Because of the previous observation, the method works well if you are
    sampling from many pages, and it's fine to have "snapshots" of all the
    pages. If, instead, you are interested in the structure of single pages,
    the method is probably too destructive for you purposes.

    - Depending on implementation details (in particular, the stretch you
    choose when more than one stretch has the highest value for the quantity
    above), you can get funny phenomena at the beginning and end of the
    extracted stretch (e.g., an incomplete sentence at the beginning, and a
    bit of boilerplate at the end).

    Because of these factors, I would recommend the BTE approach (with a
    faster re-implementation than the original one) to those that are
    interested in sampling from many pages, in a scalalable way, are not
    interested in the structure of single pages, and can live with a certain
    amount of noise. In short, if you are building a large web corpus from a
    large variety of sources, it could be the right way to go; if you are
    building a smaller corpus from more controlled sources, probably you can
    get better results with a more ad-hoc method.

    Regards,

    Marco



    This archive was generated by hypermail 2b29 : Wed Aug 10 2005 - 11:35:11 MET DST