Re: [Corpora-List] Request for help concerning a LSA problem

From: Nitin Madnani (nmadnani@gmail.com)
Date: Fri May 05 2006 - 22:33:35 MET DST

Next message: Mohand-Said Hacid: "[Corpora-List] CFP - OTM 2006 Federated Conferences"

Previous message: Ananiadou, Sophia: "[Corpora-List] TEXT MINING RESEARCH POSITION AT THE UNIVERSITY OF MANCHESTER"
In reply to: Christopher Manning: "Re: [Corpora-List] Request for help concerning a LSA problem"
Next in thread: Cecilie Desiree Widsteen: "Re: [Corpora-List] Request for help concerning a LSA problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I recommend that you look at TMG (Text-to-Matrix Generator) for
Matlab. Matlab has excellent support for sparse arrays and TMG uses
them natively. I can send you code that I used in my own project, if
you need.

Nitin

On 5/5/06, Christopher Manning <manning@cs.stanford.edu> wrote:
> Cecilie Desiree Widsteen wrote:
> > Hello all,
> >
> > I´m currently trying to implement Latent Semantic Analysis, as part of
> > an automatic classification system. I´m programming in Java, and using
> > the Jama Matrix package for the matrix stuff. I have stumbled over some
> > strange problems, and would be grateful if anyone on this list could
> > offer some help.
> > My problem is: I have implemented a class which takes care of building a
> > matrix representation of a corpus, and performs SVD over the
> > term-by-document matrix. Most of the operations are done by the Jama
> > class "Matrix". This works fine, except for the fact that when I ran
> > the program over various small test corpora (like, for instance, the one
> > from Chapter 15 in Schütze and Manning´s book Foundations of Statistical
> > NLP) most of the righ and left singular vectors contained the correct
> > values but with wrong/reversed sign?! E.g. a vector that should have the
> > values [-0.75,-0.28,-0.20, ...] are assigned the values [0.75,0.28,
> > ...]. Unfortunately, I have limited experience with linear algebra and
> > the like so now I find myself completely at loss in debugging this...
>
> This isn't a problem!!! This is the content of fn. 2 on p.561 of
> anything-other-than-early printings of FSNLP:
>
> For any given SVD solution,
> you can get additional non-identical ones by flipping signs in
> corresponding
> left and right singular vectors of $T$ and $D$, and, if there are
> two or more identical singular values, then the subspace determined by
> the corresponding singular vectors is unique, but can be described
> by any appropriate orthonormal basis vectors. But, apart
> from these cases, \acro{SVD} is unique.
>
> The minuses cancel out and so don't effect the solution.
>
> But, beyond that, I think you will find that you will have trouble doing
> anything 'large scale' (i.e., text collections with vocabularies of 20,000
> words or things like that) using Jama, because it only supports dense SVD
> calculations (that is, using 20,000x20,000 matrices, which require a lot of
> RAM). For text applications, it's usual to use something that supports
> doing SVD on sparse matrices, like the classic SVDpack, Matlab, or, if
> you're using Java, you might try MTJ:
>
> http://rs.cipr.uib.no/mtj/
>
> Chris.
>
>
>
>
> > As far as I can understand, this means that my vectors are pointing in
> > the opposite direction from the one they should, but why this is escapes
> > my understanding :)
> > Any help, hints, tricks and the like are extremely welcome! I can also
> > send over the source code on request.
> >
> > Regards,
> > --
> > Cecilie D. Widsteen
> > Department of Linguistics
> > University of Oslo
> >
> >
>
>

--
Got Blog?
http://greenideas.blogspot.com

Next message: Mohand-Said Hacid: "[Corpora-List] CFP - OTM 2006 Federated Conferences"
Previous message: Ananiadou, Sophia: "[Corpora-List] TEXT MINING RESEARCH POSITION AT THE UNIVERSITY OF MANCHESTER"
In reply to: Christopher Manning: "Re: [Corpora-List] Request for help concerning a LSA problem"
Next in thread: Cecilie Desiree Widsteen: "Re: [Corpora-List] Request for help concerning a LSA problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri May 05 2006 - 22:32:50 MET DST