RE: [Corpora-List] sentence boundary detectors

From: Nino Simunic (nino.simunic@uni-due.de)
Date: Wed Feb 28 2007 - 12:09:47 MET

  • Next message: Linda Bawcom: "[Corpora-List] Search Wordsmith by tag passive voice"

    Dear Armin,

    take a look at the >Punkt<-system. It's an >Unsupervised Multilingual
    Sentence Boundary Detection< that was tested on eleven languages and
    achieved pretty good scores:

    Tibor Kiss, Jan Strunk. 2006. Unsupervised Multilingual Sentence Boundary
    Detection. In: Computational Linguistics 32 (4). Cambridge: MIT-Press.
    485-525.
    PDF:
    http://www.linguistics.ruhr-uni-bochum.de/~kiss/publications/compling2005_KS
    27.01final.pdf

    Their current implementation is written in Perl, as far as I know.

    Bye,
    Nino

    http://www.uni-due.de/computerlinguistik/simunic.shtml

    >>-----Original Message-----
    >>From: owner-corpora@lists.uib.no
    >>[mailto:owner-corpora@lists.uib.no] On Behalf Of Armin Schmidt
    >>Sent: Tuesday, February 20, 2007 6:21 PM
    >>To: Joel Tetreault
    >>Cc: corpora@uib.no
    >>Subject: Re: [Corpora-List] sentence boundary detectors
    >>
    >>
    >>Joel,
    >>
    >>thanks. Unfortunately, many of the links on your page are
    >>indeed dead. But I'll post a summary of all the responses I
    >>got so far to the list, so you can update your link list, too.
    >>
    >>Of course, I searched the archives (and the web) before
    >>posting to corpora list but the responses to those earlier
    >>posts were of limited use only for my particular task. Also,
    >>I wanted to find out if, in the meantime, sentence splitters
    >>had been developed which could be trained on particular
    >>corpora in an language-independent manner (more on this in my
    >>summary).
    >>
    >>Cheers,
    >>Armin
    >>
    >>Joel Tetreault schrieb:
    >>>
    >>> hi Armin, if you scroll way down to the "Tools" section of
    >>my website,
    >>> and then scroll down to the "Sentence Splitters" subsection, you
    >>> should find a links to several splitters.
    >>>
    >>> http://www.cs.rochester.edu/u/tetreaul/academic.html
    >>>
    >>> (Please excuse the fact I threw all these links up one page :) )
    >>>
    >>> Your question was posed to corpora-list 3 or 4 years ago,
    >>so all the
    >>> links above (including an updated link to Scott Piao's Java
    >>one) are
    >>> from other researchers emailing in with their suggestions.
    >>I just ran
    >>> through the links, and since it has been several years, a bunch are
    >>> dead. But if you google the names of the splitter or their
    >>authors,
    >>> you can probably find their new locations.
    >>>
    >>> I'd also check out the corpora-list archives:
    >>>
    >>> http://listserv.linguistlist.org/cgi-bin/wa?S1=corpora
    >>>
    >>> there might be some emails/links that I missed...
    >>>
    >>> Joel
    >>>
    >>>
    >>> On Mon, 19 Feb 2007, Scott Songlin Piao wrote:
    >>>
    >>>> Hi Armin,
    >>>>
    >>>> I put my English sentence splitor on the website:
    >>>> http://text0.mib.man.ac.uk:8080/sentencebreaker/heuristic_tool
    >>>>
    >>>> It is rule-based Java program and is downloadable.
    >>>>
    >>>> Cheers
    >>>>
    >>>> Scott Piao
    >>>> ----------------------------
    >>>> Text Mining
    >>>> School of Computer Science
    >>>> The University of Manchester
    >>>> UK
    >>>>
    >>>>
    >>>>
    >>>>
    >>>> -----Original Message-----
    >>>> From: owner-corpora@lists.uib.no
    >>[mailto:owner-corpora@lists.uib.no]
    >>>> On Behalf Of Armin Schmidt
    >>>> Sent: 17 February 2007 19:48
    >>>> To: corpora@uib.no
    >>>> Subject: [Corpora-List] sentence boundary detectors
    >>>>
    >>>> Dear list,
    >>>>
    >>>> I was wondering if you could point me to good sentence
    >>splitters for
    >>>> the following languages: German, Russian, Spanish,
    >>English. It would
    >>>> be great if they were stand-alone programs or modules for Python
    >>>> (Perl would be ok, too ... although I'm already aware of the
    >>>> respective CPAN-modules for English and German).
    >>>>
    >>>> Since I do have corpora in all the above mentioned
    >>languages I would
    >>>> also be very interested in available implementations (not
    >>papers) of
    >>>> any unsupervised learning methods for detecting sentence
    >>boundaries
    >>>> (or rather abbreviations).
    >>>>
    >>>> Thanks,
    >>>> Armin
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>
    >>
    >>--
    >>http://diotavelli.net/people/armin/
    >>
    >>



    This archive was generated by hypermail 2b29 : Wed Feb 28 2007 - 12:21:44 MET