Re: [Corpora-List] jumk java

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Mon Jun 27 2005 - 10:33:28 MET DST

  • Next message: b siham: "[Corpora-List] how to build queries for a corpora?"

    Do you mean javascript?

    I use vilistextum:

    http://bhaak.dyndns.org/vilistextum/

    and it seems to do a good job at removing javascript and html code.

    Also, BTE (part of the Hyppia project):

    http://smi.ucd.ie/hyppia/

    reccommended to me on this list, tries to guess what is the "interesting"
    content of a page, and removes everything else (thus, not only html and
    javascript, but any text it believes to be boilerplate). If your goal is
    precision rather than recall (i.e., it's ok to occasionally throw away
    good content as long as what you keep is consistently good content), it
    does an excellent job. It's a bit slow, though.

    Regards,

    Marco



    This archive was generated by hypermail 2b29 : Mon Jun 27 2005 - 10:45:24 MET DST