[Corpora-List] New release: jTokeniser 1.2

From: Andy Roberts (andyr@comp.leeds.ac.uk)
Date: Thu Aug 04 2005 - 23:51:15 MET DST

  • Next message: Christer.Johansson@lili.uib.no: "[Corpora-List] Workshop on Anaphora Resolution: Registration"

    Hi all,

    Because I recall recently someone looking for sentence segmentation
    software, I thought I'd give a quick advertisement for jTokeniser...

    I've just released jTokeniser 1.2. jTokeniser is an opensource Java
    library to provide a simple framework for a variety of tokenisers. There
    are six currently at your disposal:

      * WhiteSpaceTokeniser - this splits a string on all occurances of
        whitespace, which include spaces, newlines, tabs and linefeeds.

      * StringTokeniser - this is basically the same as Java's
        java.util.StringTokenizer with some extra methods (and extends from
        Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser,
        however, you can specify a set of characters that are to be used to
        indicate word delimiters.

      * RegexTokeniser - this tokeniser is much more flexible as you can use
        regular expressions to define a what a token is. So, "\\w+" means
        whenever it matches one or more letters, it will consider that a word.
        By default, it uses a regular expression equivalent to a whitespace
        tokeniser.

      * RegexSeparatorTokeniser - this can be thought of as an advanced
        StringTokeniser. Whereas StringTokeniser is limited to defining
        delimiters as a set of individual characters, RegexSeparatorTokeniser
        can utilise regular expressions for a richer and more flexible
        approach.

      * BreakIteratorTokeniser - one of the most sophisticated of the lot,
        although should only be used on natural language strings to isolate
        words. It also comes with built-in rules about how to find words,
        knowing how to disregard punctuation, etc.

      * SentenceTokeniser - this also uses a BreakIterater like the above,
        but tuned towards finding sentence boundaries. The "tokens" in this
        tokeniser are in fact individual sentences.

    Now, this is just a library at the moment so you obviously need to be a
    Java programmer to utilise these tokenisers. Fortunately, they all
    follow the same simple framework. The docs and sample code will make it
    clearer. I do intend to create a GUI front-end to this library in the
    future so that the tokenisers can be utilised in a stand-alone
    application so the user need not be a Java programmer.

    Full information available at jTokeniser homepage:
    http://www.comp.leeds.ac.uk/andyr/software/jTokeniser/

    Suggestions, comments and complaints welcome. :)

    Regards,
    Andy Roberts

    -- 
    Computer Vision and Language Research Group
    School of Computing
    University of Leeds,
    Leeds, UK, LS2 9JT
    http://www.comp.leeds.ac.uk/andyr
    



    This archive was generated by hypermail 2b29 : Fri Aug 05 2005 - 00:35:11 MET DST