[Corpora-List] SenseClusters v0.95 released (now supports LSA)

From: ted pedersen (tpederse@d.umn.edu)
Date: Sat Aug 26 2006 - 20:08:04 MET DST

  • Next message: Mikhail Kopotev: "[Corpora-List] Syntactic zeros in a corpus: possible solutions"

    We are pleased to announce the release of SenseClusters version 0.95.

    SenseClusters is a freely available package that allows you to cluster
    similar contexts, or to identify clusters of related words. It is fully
    unsupervised, and can automatically discover the optimal number of
    clusters in your text.

    As of version 0.95, we now fully support Latent Semantic Analysis for
    context and word clustering, and we continue to improve the native
    SenseClusters methods, which include the ability to cluster first and
    second order representations of context.

    SenseClusters can be downloaded from :

            http://senseclusters.sourceforge.net/

    You can also try out SenseClusters via our web interface:

            http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi

    In both native and LSA modes, SenseClusters relies on lexical features
    (such as unigrams, bigrams, and co--occurrences) that can be identified
    in raw text. The tokenization is very flexible and can be defined via
    Perl regular expressions, so it is possible to work with many other
    languages besides English, and you can easily work with tokenization
    schemes other than white-space separated words, such as character based
    tokens, like 2 letter sequences, etc.

    The native SenseClusters methods support traditional first order context
    clustering, where you identify a feature set, and then determine which of
    those features occur in the contexts you are clustering. The native
    methods also support second order context clustering, where each word
    is represented by a vector of the words with which it co-occurs.
    All the words in a context to be clustered are replaced by their
    associated vectors, and these vectors are averaged together to represent
    that context. Note that you can also cluster the word vectors to identify
    sets of related words.

    Latent Semantic Analysis differs from the native SenseClusters methods in
    that each feature is represented by a vector that shows the contexts in
    which that feature occurs. Then, all the features in a context to be
    clustered are replaced by their associated vectors, and these are
    averaged together to represent the context. Note that you can also
    cluster the feature vectors directly to identify sets of related features.

    This release represents a major step forward in the functionality of
    SenseClusters. Much of work in providing LSA support was carried out by
    Mahesh Joshi this spring and summer. And like always during the last two
    years, Anagha Kulkarni played a large role in this release, and has
    provided a wide range of improvements in automatic cluster stopping and
    other areas.

    Please give this a try, and let us know if you have any comments or
    questions! If you aren't certain if your problem can be approached using
    SenseClusters, please let us know what you would like to do and maybe we
    can help you get started.

    Cordially,
    Ted, Anagha, and Mahesh

    ====================================================================

    ChangeLog:
    http://www.d.umn.edu/~tpederse/Code/Changelog.SenseClusters-v0.95.txt

    Installation Instructions:
    http://www.d.umn.edu/~tpederse/Code/SenseClusters-v0.95-INSTALL.txt

    Related Publications (includes links to data you can use):
    http://www.d.umn.edu/~tpederse/senseclusters-pubs.html

    --
    Ted Pedersen
    http://www.d.umn.edu/~tpederse
    



    This archive was generated by hypermail 2b29 : Sat Aug 26 2006 - 20:11:17 MET DST