Corpora: homophones in English

Doug Cooper (doug@th.net)
Fri, 27 Nov 1998 14:43:42 +0700

A question has come up on the Southeast Asian languages list
(sealang-l) regarding processes that form homophones. The
argument has been made that because SEA words tend to lose
syllables, there will be an 'unusual' number of one-syllable
overlaps.

In quick and dirty terms, some 10-12% of Thai dictionary headword
forms have two or more entries (these are presumed to reflect
distinct etymology), and about 12-14% of headword sounds have
two or more entries. Restricted to a universe of one-syllable
words, the figures are about 13% (duplicated orthography) and
16% (duplicated sounds).

Does anybody have a sense of what the equivalents are for
English lemmas? For my purposes, all polysemous derivations,
regardless of POS, are a single entry, while divergent
etymolgies, even if suspect, are probably acceptable as
multiple entries.

Yes, I know there is lots of slop involved in making
such estimates. I'm willing to assume that the lexicographic
methods of the 60's - 80's on both the Thai and English sides
are more or less equivalent.

Thanks,
Doug Cooper
__________________________________________________
1425 VP Tower, 21/45 Soi Chawakun
Rangnam Road, Rajthevi, Bangkok, 10400
doug@th.net (662) 246-8946 fax (662) 246-8789

Southeast Asian Software Research Center, Bangkok
http://seasrc.th.net --> SEASRC Web site
http://seasrc.th.net/sealang --> SEALANG Web site