Hi,
> I have large (1 million by 1 million) term-term matrices. What SVD
> packages work with such massive datasets? I have tried Matlab and
> SVDPACKC without much success.
Both Matlab and the Harwell-Boeing format used by SVDPACK(C) use sparse
matrices, which means that the dimensionality (=number of terms) does not
really matter, but the number of non-zero entries does. To solve your
problem, you could either:
- adjust the constants in the SVDPACKC source code that give maximum limits
for dimensionality and non-zero entries and run the SVD on a machine with
lots of memory.
Ted Pedersen's SenseClusters software uses SVDPACKC and its documentation
gives good advice regarding the values that you need to tweak.
or
- try to somehow reduce the number of terms and/or the number of non-zero
entries. A sensible thing to do would be to throw away terms that don't occur
at least 5 times in your corpus, and, if the matrix is still too big, throw
away all entries which are below a certain threshold (e.g. all entries with
only 1 in it).
Cheers,
Yannick
-- Yannick Versley Seminar für Sprachwissenschaft, Abt. Computerlinguistik Wilhelmstr. 19, 72074 Tübingen Tel.: (07071) 29 77352
This archive was generated by hypermail 2b29 : Tue Mar 06 2007 - 16:08:46 MET