Re: [Corpora-List] SVD on high-dimension data

From: Dominic Widdows (widdows@maya.com)
Date: Tue Mar 06 2007 - 17:21:02 MET

  • Next message: Iztok Kosem: "[Corpora-List] Aston Corpus Symposium"

    Dear Jamie, David,

    I'm delighted to hear about your success using Infomap, thanks!

    However, I feel I should chime in with a couple of words of warning.
    Infomap works by selecting a comparatively small number of "content
    bearing words" as column labels. These are normally chosen based upon
    frequency, e.g., we have typically used the 1000 most frequent non-
    stop words as column labels. This is a far cry from your 1 million by
    1 million matrix. If Infomap was configured to treat all these terms
    as column labels, it would try to malloc a 1 million by 1 million
    matrix, which (if your matrix entry type is a 4 byte float) comes to
    something like 4 terabytes of RAM! That's before you've even tried to
    do anything computationally intensive with the matrix. By the time
    you have a computer with that much memory, I practically guarantee
    that 1 million terms will be considered a small dataset, so I believe
    that the scalability of software like Infomap is always going to be
    limited unless we make some radical changes to the way the software
    works. I'm hoping to do this at some point, but in the meantime, if
    you want to use Infomap your number of columns is limited.

    We should probably use sparse matrices to count the coocurrences in
    the first place, but even if we could get this far, we'd run into
    scaling issues with SVD computation at some point. I'm not sure which
    weak link would break first - SVDPACKC does take advantage of some
    sparseness in the matrix format but it certainly involves a huge
    amount of number crunching for large matrices.

    Best wishes,
    Dominic

    On Mar 6, 2007, at 10:38 AM, David Reitter wrote:

    > Jamie,
    >
    > On 6 Mar 2007, at 14:59, Jamie Smith wrote:
    >
    >> I have large (1 million by 1 million) term-term matrices. What SVD
    >> packages work with such massive datasets? I have tried Matlab and
    >> SVDPACKC without much success.
    >
    > Have a look at Infomap,
    >
    > http://infomap-nlp.sourceforge.net/
    > http://infomap.stanford.edu/
    >
    > we've used it successfully on the Aquaint and DUC2005 data (100+
    > million words).
    >
    >
    > --
    > David Reitter
    > ICCS/HCRC, Informatics, University of Edinburgh
    > http://www.david-reitter.com
    >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Tue Mar 06 2007 - 17:20:03 MET