[Corpora-List] SenseClusters v0.95 released (now supports LSA)

From: ted pedersen (tpederse@d.umn.edu)
Date: Sat Aug 26 2006 - 20:08:04 MET DST

Next message: Mikhail Kopotev: "[Corpora-List] Syntactic zeros in a corpus: possible solutions"

Previous message: Marie-Paule PERY-WOODLEY: "[Corpora-List] EXTENDED DEADLINE: TAL Journal, Discourse and Document Processing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

We are pleased to announce the release of SenseClusters version 0.95.

SenseClusters is a freely available package that allows you to cluster
similar contexts, or to identify clusters of related words. It is fully
unsupervised, and can automatically discover the optimal number of
clusters in your text.

As of version 0.95, we now fully support Latent Semantic Analysis for
context and word clustering, and we continue to improve the native
SenseClusters methods, which include the ability to cluster first and
second order representations of context.

SenseClusters can be downloaded from :

http://senseclusters.sourceforge.net/

You can also try out SenseClusters via our web interface:

http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi

In both native and LSA modes, SenseClusters relies on lexical features
(such as unigrams, bigrams, and co--occurrences) that can be identified
in raw text. The tokenization is very flexible and can be defined via
Perl regular expressions, so it is possible to work with many other
languages besides English, and you can easily work with tokenization
schemes other than white-space separated words, such as character based
tokens, like 2 letter sequences, etc.

The native SenseClusters methods support traditional first order context
clustering, where you identify a feature set, and then determine which of
those features occur in the contexts you are clustering. The native
methods also support second order context clustering, where each word
is represented by a vector of the words with which it co-occurs.
All the words in a context to be clustered are replaced by their
associated vectors, and these vectors are averaged together to represent
that context. Note that you can also cluster the word vectors to identify
sets of related words.

Latent Semantic Analysis differs from the native SenseClusters methods in
that each feature is represented by a vector that shows the contexts in
which that feature occurs. Then, all the features in a context to be
clustered are replaced by their associated vectors, and these are
averaged together to represent the context. Note that you can also
cluster the feature vectors directly to identify sets of related features.

This release represents a major step forward in the functionality of
SenseClusters. Much of work in providing LSA support was carried out by
Mahesh Joshi this spring and summer. And like always during the last two
years, Anagha Kulkarni played a large role in this release, and has
provided a wide range of improvements in automatic cluster stopping and
other areas.

Please give this a try, and let us know if you have any comments or
questions! If you aren't certain if your problem can be approached using
SenseClusters, please let us know what you would like to do and maybe we
can help you get started.

Cordially,
Ted, Anagha, and Mahesh

====================================================================

ChangeLog:
http://www.d.umn.edu/~tpederse/Code/Changelog.SenseClusters-v0.95.txt

Installation Instructions:
http://www.d.umn.edu/~tpederse/Code/SenseClusters-v0.95-INSTALL.txt

Related Publications (includes links to data you can use):
http://www.d.umn.edu/~tpederse/senseclusters-pubs.html

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Next message: Mikhail Kopotev: "[Corpora-List] Syntactic zeros in a corpus: possible solutions"
Previous message: Marie-Paule PERY-WOODLEY: "[Corpora-List] EXTENDED DEADLINE: TAL Journal, Discourse and Document Processing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sat Aug 26 2006 - 20:11:17 MET DST