Corpora: Sentence splitting (summary)

Tony Rose (tgr@cre.canon.co.uk)
Mon, 26 Oct 1998 11:50:12 +0000 (GMT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Christopher A. Brewster: "RE: Corpora: Corpus of scientific texts"
Previous message: GCW: "Re: Corpora: Corpus of scientific texts"

Many thanks to all those who responded to my original query on
sentence splitting, particularly those who took the trouble to send
algorithms/regexes (regular expressions) and code samples.

Evidently, the problem of sentence splitting (like many others in NLP)
can be solved to varying degrees by a range of techniques, from the
simple and compact to the sophisticated and language/locale-dependent.
It's down to the individual to decide where the optimal trade-off lies
for their own application.

A couple of people asked me what our application was: the answer is
we need reliable sentence splitting routines to accurately index the
pages in Canon's web space. These pages are then made accessible
through the search engine located at: http://csweb.cre.canon.co.uk/

A few others asked if we plan to make any of our code or results
publicly available. The answer is yes, wherever possible! In fact,
we have already released a number of Perl modules for a variety
of purposes. You can read more about these at:

http://www.cre.canon.co.uk/perl/

Attached below is a summary of the (many) contributions.

Regards,
Tony Rose
_______________________________________________________________________
Dr TG Rose Speech and Language Group Canon Research Centre Europe Ltd
Occam Road, Surrey Research Park, Guildford, Surrey, UK GU2 5YJ
email: tgr@cre.canon.co.uk tel: +44 1483 448807 fax: +44 1483 448845
_______________________________________________________________________

1. From David S. Day:
We include a sentence tagger, called mini-sent, in the "full" Solaris
distribution of our corpus development tool, Alembic Workbench. You can
obtain this code by downloading the full version from our external web
site, www.mitre.org/technology/nlp. (Only the binary is distributed
right now, but if you wish, we could probably send you the source code.
The C program was generated using flex (a pattern-matching language
preprocessor that compiles into C).

Also, David Palmer, now of MITRE, has developed a neural network
algorithm for performing sentence tagging, and has written up these
results with his co-author, Marti Hearst, in a recent Computational
Linguistics journal article. In this and other articles he describes
the data sets that he used to train and test his algorithm, and compares
it to the performance of other systems perfoming on this task. You may
want to write to him for more information (palmer@mitre.org).

2. From Philip Resnik:
Two packages that you may find useful are Adwait Ratnaparkhi's
MXTERMINATOR Sentence Boundary Detector, available at

http://www.cis.upenn.edu/~adwait/statnlp.html

and David Palmer's SATZ, available at

http://galaxy.cs.berkeley.edu/src/satz/

3. From pvozila@lhs.com:
Adwait Ratnaparkhi has done work on identifying setence boundaries using a
maximum entropy model. I can't recall offhand if he made the code available
but he may have and the references may be useful as well. The paper is
available at:
http://www.cis.upenn.edu/~adwait/statnlp.html

4. From Lluís Padró:
We deal with this problem in the following way:

1. We separate all stops and puntuation marks

Ex: "Dr. Smith was ill yesterday."
is converted to:

Dr
.
Smith
was
ill
yesterday
.

2. Then, a module that recognizes multword compounds
re-joins the tokens that are likely to be a single token:

Dr._Smith
was
ill
yesterday
.

3. The full stops (and question/admiration marks, etc) that remain
are considered sentence boundaries

Obviously this method is rather heuristic and won't work
in all cases.
Its goodness will depend mainly on the accuracy of the
multiword
recognizer. Nevertheless, simple heuristics such as joining all the
consecutive capitalized words provide reasonable results.
It can be improved with lists of proper nouns (smith, john,
...), lists of
abreviations (dr, km, cm, mrs, ...), or any kind of list you like..

Next message: Christopher A. Brewster: "RE: Corpora: Corpus of scientific texts"
Previous message: GCW: "Re: Corpora: Corpus of scientific texts"