Yes, sorry if that wasn't clear.
n and w are the same 'word' but n comes from the document and IDF(w) from
the large corpus.
Clive
----- Original Message -----
From: "Gaël Dias" <ddg@di.ubi.pt>
To: "Clive De Silva" <cd334@cam.ac.uk>
Cc: <chenwl@mail.neu.edu.cn>; <corpora@hd.uib.no>
Sent: Wednesday, July 07, 2004 4:12 PM
Subject: Re: [Corpora-List] How to word presentation for word clustering?
Be careful,
IDF is unique for a word and does not depend on the document
so that you have:
vector w = { tf(1)*IDF(w), tf(2)*IDF(w)....,tf(n)*IDF(w))}
Gaël.
Clive De Silva wrote:
> Dear Chen Wenliang,
>
> I am using TF*IDF values as my representation for words.
> vector w = { tf(1)*IDF(1), tf(2)*IDF(2)....,tf(n)*IDF(n))} where the IDF
is
> computed from a large corpus. This seems to give better results than just
> the raw frequency counts.
> The representations I investigated were: TF, TF*IDF and simple binary(1
> represents the word existing in the vector and 0 if it isn't) counts.
>
> Regards,
>
> Clive De Silva
> University of Cambridge
-- --------------------------------------------------------- Gaël Harry Dias, PhD | Assistant Professor Human Language Technology Group | [www.di.ubi.pt/~ddg] Computer Science Department | [ddg@di.ubi.pt] Beira Interior University | [Tel: +351 275 319 700] 6201-001 - Covilhã - PORTUGAL | [Fax: +351 275 319 732] ---------------------------------------------------------
This archive was generated by hypermail 2b29 : Wed Jul 07 2004 - 17:13:43 MET DST