Re: [Corpora-List] SVD on high-dimension data

From: Yannick Versley (versley@sfs.uni-tuebingen.de)
Date: Tue Mar 06 2007 - 16:10:30 MET

  • Next message: David Reitter: "Re: [Corpora-List] SVD on high-dimension data"

    Hi,

    > I have large (1 million by 1 million) term-term matrices. What SVD
    > packages work with such massive datasets? I have tried Matlab and
    > SVDPACKC without much success.
    Both Matlab and the Harwell-Boeing format used by SVDPACK(C) use sparse
    matrices, which means that the dimensionality (=number of terms) does not
    really matter, but the number of non-zero entries does. To solve your
    problem, you could either:
    - adjust the constants in the SVDPACKC source code that give maximum limits
    for dimensionality and non-zero entries and run the SVD on a machine with
    lots of memory.
    Ted Pedersen's SenseClusters software uses SVDPACKC and its documentation
    gives good advice regarding the values that you need to tweak.
    or
    - try to somehow reduce the number of terms and/or the number of non-zero
    entries. A sensible thing to do would be to throw away terms that don't occur
    at least 5 times in your corpus, and, if the matrix is still too big, throw
    away all entries which are below a certain threshold (e.g. all entries with
    only 1 in it).

    Cheers,
    Yannick

    -- 
    Yannick Versley
    Seminar für Sprachwissenschaft, Abt. Computerlinguistik
    Wilhelmstr. 19, 72074 Tübingen
    Tel.: (07071) 29 77352
    



    This archive was generated by hypermail 2b29 : Tue Mar 06 2007 - 16:08:46 MET