Corpora: Sentence splitting

Tony Rose (tgr@cre.canon.co.uk)
Fri, 16 Oct 1998 14:18:47 +0100 (BST)

Does anyone have any experience of developing simple algorithms or
regular expressions for detecting sentence boundaries in English text?
The naive solution would be simply to look for full stops (periods)
followed by whitespace, but this fails on strings such as "Dr. Smith".

Indeed, the problem is common to so many NLP applications that it may
be reasonable to suggest that someone out there must have worked on
this and packaged up the result as a code 'module', to save others
the trouble. Yet if you examine the code to a great many NLP
applications, you find that typically people will develop their
own solution each time.

So, to start the ball rolling, here's a Perl regular expression
for detecting sentences, suggested by one of my colleagues:

/
(
.+? # match (non-greedy) anything ...
[.!?] # ... followed by any one of !?.
[")]? # ... and optionally " or )
)
(?= # with lookahead that it is followed by ...
(?: # either ...
\s+ # some whitespace ...
["(]? # maybe a " or ( ...
[A-Z] # and capital letter
| # or ...
\s*$ # optional whitespace, followed by end of string
)
)
/gx
;

Can anyone suggest a better algorithm/solution? It doesn't have to be
in Perl or any other particular language: pseudocode will do fine.
Also, does anyone know of any established test sets for evaluating
such algorithms? If people want to reply directly to me then I'll
summarise to the list.

(NB - I plan also to submit this question to a Perl mailing list, but
right now the experiences of the corpora community are of greater
interest to me.)

Thanks,
Tony Rose
_______________________________________________________________________
Dr TG Rose Speech and Language Group Canon Research Centre Europe Ltd
Occam Road, Surrey Research Park, Guildford, Surrey, UK GU2 5YJ
email: tgr@cre.canon.co.uk tel: +44 1483 448807 fax: +44 1483 448845
_______________________________________________________________________