RE: [Corpora-List] sentence boundary detectors

From: Victor Kapustin (victor.kapustin@gmail.com)
Date: Sun Feb 18 2007 - 14:59:56 MET

  • Next message: jenny@asian-emphasis.com: "[Corpora-List] Corpus of political discourse"

    Armin,

    > I was wondering if you could point me to good sentence
    > splitters for the
    > following languages: German, Russian
    For Russian:

    http://aot.ru/download/graphan.tar.gz (source in C++, dll is included in
    http://aot.ru/download/shortrml.zip).

    For most purposes I use a regexp (in javascript, conversion to Perl/Python is
    straightforward):

    var _DELIMS_OPEN_RAW_ = '(["</' ;
    var _DELIMS_OPEN_ = '\\'+_DELIMS_OPEN_RAW_.split('').join('\\') ;
    var sentenceSplitter = new RegExp(
    '(?:\\.|\\!|\\?)+\\s+(?=['+_DELIMS_OPEN_+']?[А-ЯЁA-Z])' ) ;

    --
    Victor Kapustin
    



    This archive was generated by hypermail 2b29 : Sun Feb 18 2007 - 14:57:54 MET