[Corpora-List] SUMMARY: sentence boundary detectors

From: Armin Schmidt (armin.sch@gmail.com)
Date: Fri Mar 02 2007 - 22:40:43 MET

  • Next message: Marcello Federico: "[Corpora-List] INTERNATIONAL WORKSHOP ON SPOKEN LANGUAGE TRANSLATION"

    Dear all,

    thank you for all the helpful responses. I was preparing several
    parallel corpora for a machine translation task between the languages
    German, Russian, English, and Spanish. In order to achieve good results
    from sentence alignment, I was looking for a sentence splitter that
    would perform equally well on all the data sets and, if at all, make the
    same or similar errors for all the languages. Also, I didn't have any
    lists of abbreviations.

    A particularly nice response I received from Jan Strunk who kindly
    provided a preliminary implementation of his system 'Punkt'
    (http://www.linguistics.ruhr-uni-bochum.de/~strunk/ks2005FINAL.pdf).
    'Punkt' learns abbreviations and sentence boundaries in a
    language-independent, unsupervised manner.

    Links to similar tools for one or several languages were of great help,
    too. They are:

    Russian:
    http://aot.ru/download/graphan.tar.gz (source in C++, dll is included in
    http://aot.ru/download/shortrml.zip).

    German, Russian, English:
    http://www.cis.uni-muenchen.de/~wastl/misc/tokenizer.tgz
    (fast, rule-based).

    English:
    http://text0.mib.man.ac.uk:8080/sentencebreaker/heuristic_tool
    (rule-based, Java)

    Language-independent:
    Tools of the SRI LM toolkit: http://www.speech.sri.com/projects/srilm/

    Needs to be provided with a set of abbreviations for a particular language:
    http://www.pojkfilmsklubben.org/mickel/code/python/SentenceSplitter.py
    (Python)

    For Perl, there are several modules available on http://www.cpan.org/
    which can be extended for other than the given languages. E.g. for
    Russian, use EN::Sentence and add acronym list:
    add_acronyms('тел','т','г','млрд','млн','тыс','др','р','кг','л','см','пп','им','ст','муж','жен','ул','пр','кв','ч','п','д','с','стр');

    Thanks again & best regards,
    Armin

    Armin Schmidt schrieb:
    > Dear list,
    >
    > I was wondering if you could point me to good sentence splitters for the
    > following languages: German, Russian, Spanish, English. It would be
    > great if they were stand-alone programs or modules for Python (Perl
    > would be ok, too ... although I'm already aware of the respective
    > CPAN-modules for English and German).
    >
    > Since I do have corpora in all the above mentioned languages I would
    > also be very interested in available implementations (not papers) of any
    > unsupervised learning methods for detecting sentence boundaries (or
    > rather abbreviations).
    >
    > Thanks,
    > Armin
    >
    >
    >

    -- 
    http://diotavelli.net/people/armin/
    



    This archive was generated by hypermail 2b29 : Fri Mar 02 2007 - 22:39:10 MET