[Corpora-List] New release: jTokeniser 1.2

From: Andy Roberts (andyr@comp.leeds.ac.uk)
Date: Thu Aug 04 2005 - 23:51:15 MET DST

Next message: Christer.Johansson@lili.uib.no: "[Corpora-List] Workshop on Anaphora Resolution: Registration"

Previous message: Linguistic Data Consortium: "[Corpora-List] New LDC Corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

Because I recall recently someone looking for sentence segmentation
software, I thought I'd give a quick advertisement for jTokeniser...

I've just released jTokeniser 1.2. jTokeniser is an opensource Java
library to provide a simple framework for a variety of tokenisers. There
are six currently at your disposal:

* WhiteSpaceTokeniser - this splits a string on all occurances of
whitespace, which include spaces, newlines, tabs and linefeeds.

  * StringTokeniser - this is basically the same as Java's
    java.util.StringTokenizer with some extra methods (and extends from
    Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser,
    however, you can specify a set of characters that are to be used to
    indicate word delimiters.

  * RegexTokeniser - this tokeniser is much more flexible as you can use
    regular expressions to define a what a token is. So, "\\w+" means
    whenever it matches one or more letters, it will consider that a word.
    By default, it uses a regular expression equivalent to a whitespace
    tokeniser.

  * RegexSeparatorTokeniser - this can be thought of as an advanced
    StringTokeniser. Whereas StringTokeniser is limited to defining
    delimiters as a set of individual characters, RegexSeparatorTokeniser
    can utilise regular expressions for a richer and more flexible
    approach.

  * BreakIteratorTokeniser - one of the most sophisticated of the lot,
    although should only be used on natural language strings to isolate
    words. It also comes with built-in rules about how to find words,
    knowing how to disregard punctuation, etc.

  * SentenceTokeniser - this also uses a BreakIterater like the above,
    but tuned towards finding sentence boundaries. The "tokens" in this
    tokeniser are in fact individual sentences.

Now, this is just a library at the moment so you obviously need to be a
Java programmer to utilise these tokenisers. Fortunately, they all
follow the same simple framework. The docs and sample code will make it
clearer. I do intend to create a GUI front-end to this library in the
future so that the tokenisers can be utilised in a stand-alone
application so the user need not be a Java programmer.

Full information available at jTokeniser homepage:
http://www.comp.leeds.ac.uk/andyr/software/jTokeniser/

Suggestions, comments and complaints welcome. :)

Regards,
Andy Roberts

-- 
Computer Vision and Language Research Group
School of Computing
University of Leeds,
Leeds, UK, LS2 9JT
http://www.comp.leeds.ac.uk/andyr

Next message: Christer.Johansson@lili.uib.no: "[Corpora-List] Workshop on Anaphora Resolution: Registration"
Previous message: Linguistic Data Consortium: "[Corpora-List] New LDC Corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Aug 05 2005 - 00:35:11 MET DST