[Corpora-List] near duplicate detection

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Thu Jun 02 2005 - 14:28:56 MET DST

  • Next message: Serge Sharoff: "[Corpora-List] Web corpora vs. Gigaword"

    Dear Linda,

    There was a thread about near duplicate detection on the list in late
    December/early January -- perhaps, there is also something useful to your
    problem there.

    In particular, Marc Kupietz made his tool for near dup detection
    available:

    http://torvald.aksis.uib.no/corpora/2004-3/0374.html

    We also have a tool, that we hope to be able to make available in a week
    or so (it requires mysql, and I'm not sure it would run on any platform
    but linux...)

    Best regards,

    Marco



    This archive was generated by hypermail 2b29 : Thu Jun 02 2005 - 14:33:09 MET DST