Re: Corpora: Syllable prediction errors

Bill Teahan (bill@it.lth.se)
Mon, 22 Feb 1999 11:54:59 +0100

Bruce L. Lambert wrote:

> I tested my syllable prediction program against 2063 generic drug
names
> with known pronunciations. I achieved 93% accuracy using the code I
posted
> yesterday. Below are the errors. It's mostly double vowel problems,
"qu"
> problems, and a few assorted silent "e" problems. I'll try to work
around
> these and I'll let you know how I do. Others might want to try their
> techniques on the "difficult" words below.
>
> Word Pronunciation Predicted
Num
> Sylls
> ---- -----------------------
> -------------------
> "Acacia" "a kay(') sha" 4
> "Acetylcysteine" "a se teel sis(') teen" 6
> "Albutoin" "al byoo(') toyn" 4

I'm currently working in some software that gets very good results with
the problem of word segmentation (e.g. Chinese and English text, both
99% accuracy). This software should work just as well at predicting
syllable and sentence boundaries. However, what is needed is a large
corpus of already trained text with either syllables or sentence
boundaries
explicitly marked. Does anyone know where I can obtain such data?

Bill Teahan
Visiting researcher
Dept. of Information Technology
Lund University
Lund, Sweden