Re: [Corpora-List] Request for help concerning a LSA problem

From: Nitin Madnani (nmadnani@gmail.com)
Date: Fri May 05 2006 - 22:33:35 MET DST

  • Next message: Mohand-Said Hacid: "[Corpora-List] CFP - OTM 2006 Federated Conferences"

    I recommend that you look at TMG (Text-to-Matrix Generator) for
    Matlab. Matlab has excellent support for sparse arrays and TMG uses
    them natively. I can send you code that I used in my own project, if
    you need.

    Nitin

    On 5/5/06, Christopher Manning <manning@cs.stanford.edu> wrote:
    > Cecilie Desiree Widsteen wrote:
    > > Hello all,
    > >
    > > I´m currently trying to implement Latent Semantic Analysis, as part of
    > > an automatic classification system. I´m programming in Java, and using
    > > the Jama Matrix package for the matrix stuff. I have stumbled over some
    > > strange problems, and would be grateful if anyone on this list could
    > > offer some help.
    > > My problem is: I have implemented a class which takes care of building a
    > > matrix representation of a corpus, and performs SVD over the
    > > term-by-document matrix. Most of the operations are done by the Jama
    > > class "Matrix". This works fine, except for the fact that when I ran
    > > the program over various small test corpora (like, for instance, the one
    > > from Chapter 15 in Schütze and Manning´s book Foundations of Statistical
    > > NLP) most of the righ and left singular vectors contained the correct
    > > values but with wrong/reversed sign?! E.g. a vector that should have the
    > > values [-0.75,-0.28,-0.20, ...] are assigned the values [0.75,0.28,
    > > ...]. Unfortunately, I have limited experience with linear algebra and
    > > the like so now I find myself completely at loss in debugging this...
    >
    > This isn't a problem!!! This is the content of fn. 2 on p.561 of
    > anything-other-than-early printings of FSNLP:
    >
    > For any given SVD solution,
    > you can get additional non-identical ones by flipping signs in
    > corresponding
    > left and right singular vectors of $T$ and $D$, and, if there are
    > two or more identical singular values, then the subspace determined by
    > the corresponding singular vectors is unique, but can be described
    > by any appropriate orthonormal basis vectors. But, apart
    > from these cases, \acro{SVD} is unique.
    >
    > The minuses cancel out and so don't effect the solution.
    >
    > But, beyond that, I think you will find that you will have trouble doing
    > anything 'large scale' (i.e., text collections with vocabularies of 20,000
    > words or things like that) using Jama, because it only supports dense SVD
    > calculations (that is, using 20,000x20,000 matrices, which require a lot of
    > RAM). For text applications, it's usual to use something that supports
    > doing SVD on sparse matrices, like the classic SVDpack, Matlab, or, if
    > you're using Java, you might try MTJ:
    >
    > http://rs.cipr.uib.no/mtj/
    >
    > Chris.
    >
    >
    >
    >
    > > As far as I can understand, this means that my vectors are pointing in
    > > the opposite direction from the one they should, but why this is escapes
    > > my understanding :)
    > > Any help, hints, tricks and the like are extremely welcome! I can also
    > > send over the source code on request.
    > >
    > > Regards,
    > > --
    > > Cecilie D. Widsteen
    > > Department of Linguistics
    > > University of Oslo
    > >
    > >
    >
    >

    --
    Got Blog?
    http://greenideas.blogspot.com
    



    This archive was generated by hypermail 2b29 : Fri May 05 2006 - 22:32:50 MET DST