Re: Corpora: Capacity of a namespace

Robert Luk (csrluk@comp.polyu.edu.hk)
Thu, 25 Feb 1999 10:17:49 +0800

> >Given the 26-letter English alphabet and a word of a given length L, how
> >many phonologically legal, pronounceable names can be constructed?

> Well, Bruce, unless I miss my guess, you'll have to do the work
> yourself, but it shouldn't be too arduous. You just have to follow
> these steps (many of them will require you to make some decisions, such
> as the length and nature of possible vowel sequences, and you'll no
> doubt find out some interesting stuff along the way). (By the way,
> check out some intro phonology texts -- eg, Giegerich -- for some ideas
> on some of these questions).
>
> 1) characterize [or even list] all permissible intervocalic consonant
> sequences;
> 2) characterize [or list] all permissible interconsonantal vowel
> sequences;
> 3) characterize all permissible
> a) word-initial consonant sequences; and
> b) word-final consonant sequences [these may be rather different
> than the answer to (1), or even than some subdivision of it]
> 4) characterize all permissible
> a) word-initial vowel sequences; and
> b) word-final vowel sequences [again, likely to be somewhat
> different from (2).

Is "tsetse fly" (or tzetze) an English word? The name appears in the Collins
dictionary and perhaps others like OALD. The initial consonant sequence "ts" is
not what we usually find in English words, according to an English Professor (Gimson?).
The permissible thing may be a bit fuzzy. Anyway, as long as they don't happen too often,
it won't affect the counts too much. But then, should medical terms using Greek-like or Latin-like
origin be considered as English words (e.g. Psuedopodia)? These will come up in large numbers.
At other bit, the phonotactics in the phonemic domain is well specified but it may not
be in the orthography.

Best,

Robert Luk
Dept. of Computing
Hong Kong Polytechnic University

> Then, just start with (3a), followed by (2), followed by (1),
> ..., but ending with (3b) or (4b), according to the case at hand, for
> one set of possible words. Likewise, start with (4a), followed by (1),
> followed by (2), by (1), ... but ending with (3b) or (4b), ...
>
> You get the idea. Not superelegant, but a simple program should
> make it work, and you just add a counter, and in the morning when the
> program is done, you just check the number on the counter. You really
> don't want to look at the results, I'm sure, because they'd look pretty
> bad (_I_ certainly wouldn't buy a medicine with most of the names that
> would be produced).
>
> Jim
>
> James L. Fidelholtz e-mail: jfidel@siu.buap.mx
> Maestri'a en Ciencias del Lenguaje
> Instituto de Ciencias Sociales y Humanidades
> Beneme'rita Universidad Auto'noma de Puebla, ME'XICO
>
>
>