At 1/5/2005 05:59 AM, you wrote:
>Once the index construction is complete the lookup of
>(near) duplicates of a single document certainly takes almost no time.
>What actually takes 2 hours for 1.000.000 documents is the construction
>of the index and the computation of a complete similarity matrix (the
>output is certainly constrained by some minimum overlap ratio...) for
>all documents.
Sorry! I thought you meant that it took 2 hours to find documents similar
to a single one once the index was created. Indeed creating the initial
index can take several hours. Once created, computing similarities should
be pretty fast.
Normand Peladeau
Provalis Research
www.simstat.com
This archive was generated by hypermail 2b29 : Wed Jan 05 2005 - 13:58:13 MET