Re: Corpora: Capacity of a namespace

Bill Fisher (william.fisher@nist.gov)
Wed, 24 Feb 1999 10:50:50 -0500

On Feb 23, 5:46pm, Bruce L. Lambert wrote:
> Subject: Corpora: Capacity of a namespace
> Hi Folks,
>
> I wonder if anyone out there could help shed light on the following question:
>
> Given the 26-letter English alphabet and a word of a given length L, how
> many phonologically legal, pronounceable names can be constructed?
>
[...]

This looked kind of interesting, so I threw together a program
that calculates the number and fraction of letter strings of
a certain size that spell pronounceable words. It generates
each, then produces a single most likely pronunciation by
applying a high-accuracy set of TTP rules, then tests each
such pron for pronounceability by checking to see if it's
syllabifiable acc. to Dan Kahn's theory (at the slowest
speaking rate). Here are some results so far (more on the
way, but this dumb overgenerate-and-filter process takes
time):

N LETS N STRINGS N PRON FRAC PRON
2 676 267 .39497
3 17576 6810 .38746
4 456976 145624 .31867

(I ommitted 1-letter words because my rules are trained
to always come out with the letter name in such cases.)

It looks like the trend is down, and you can guess that
for longer words about 25-30% will be pronounceable.

You asked for a *theoretical* limit, but I don't think
there is such an animal. For one thing, it depends on
the degree to which you think foreign loans license
pronunciations of funny-looking new spellings.

Another approach you could try is to make a grammar
of word spellings and use it to calculate more directly
how many there are. The closest published thing to
this that I recall seeing is an old formula in a paper
by Benjamin Whorf, reprinted in a book of his collected
works. You could also induce a grammar based on word
spellings in a dictionary, but the tricky thing here
would be for it to make the right predictions about
unseen words, of course.

- Bill F.

-- 
Bill Fisher