Re: Corpora: Sentence splitting

Mike Scott (lexical@ndirect.co.uk)
Mon, 19 Oct 1998 11:24:40 +0100

Tony, hi
This is from the WordSmith Tools help file (section in Viewer Tool):
*********
When is a sentence not a sentence?
There is no perfect mechanical way of determining sentence-breaks. For
example,
a heading may well have no final full stop but would normally not be
considered
part of the sentence which follows it. And a sentence may often have no final
full stop, if what follows it is a list of items.
The algorithm used by Viewer is: a sentence ends if a full-stop, question-mark
or exclamation-mark (.?!) is immediately followed by one or more word
separators and if the next non-punctuation symbol is a capital letter A..Z or
an accented capital letter, a number or a currency symbol. The same routine is
used in WordList, though WordList attempts to distinguish between sentences
and
headings, so numbers of sentences in the two Tools are not likely to match.
Consider this chunk from A Tale of Two Cities:
"Wo-ho!" said the coachman. "So, then! One more pull and you're at the top and
be damned to you, for I have had trouble enough to get you to it! - Joe!"
Viewer will mistakenly consider /- Joe!/ as a separate sentence, but handles
/"Wo-ho!" said the coachman./ as one: though the program would split it in two
if the word after /ho!/ had a capital lettter (e.g. in /Wild Bill, the
coachman, said./)
Viewer cannot therefore be expected to handle all sentence boundaries exactly
as you would. (I saw Mr. Smith. would be considered two sentences; several
headings may be bundled together as one sentence.)
***************
I decided ages ago not to use language-specific algorithms, and hence have not
attempted to detect a common set of abbreviations such as Dr./Mrs. etc. It
would be possible to add a search for such strings to the code of course.
The allowed punctuation after .?! must include various kinds of brackets &
apostrophes
You also should consider ... as different from . for sentence counting

All the best -- Mike

*************************************************
Mike Scott
AELSU, English
Univ. of Liverpool
Liverpool L69 3BX
Mike.Scott@liv.ac.uk
http://www.liv.ac.uk/~ms2928/homepage.html
WordSmith:
http://www.liv.ac.uk/~ms2928/wordsmith/index.htm (Liverpool)
http://www.ndirect.co.uk/~lexical/wordsmith/index.htm (London)
*************************************************