Re: [Corpora-List] What proportion of letter ngrams occur in English?

From: Bruce L. Lambert, Ph.D. (lambertb@uic.edu)
Date: Tue Jan 27 2004 - 18:42:27 MET

  • Next message: Klebanov Beata: "[Corpora-List] Summary: Corpus with restricted vocabulary"

    Not at all disappointed by the responses. I know this is a difficult and
    unanswered question. Perhaps I should have supplied more context when I
    initially asked my question. So here goes.

    It is unfortunately rather common for drug names with similar spellings or
    pronunciations (e.g., Zantac/Xanax, Celebrex/Celexa/Cerebyx) to be confused
    by doctors, nurses, pharmacists and patients. Often there confusions are
    harmless, but sometimes they are fatal. By our best estimates, these "wrong
    drug" errors occur several million times per year in the U.S.

    One response is to ask drug companies to come up with less confusing names.
    They claim that is nearly impossible because "there are only 26 letters"
    and the space for distinct (non-confusing) new names is "running out." So
    this is the crux of the issue. Is the space for new names running out? The
    only way to say is to calculate something like the "capacity" of the name
    space (given some assumptions, e.g., 8 letters or three syllables). There
    are many ways to approach this, several of which have been alluded to in
    responses to my initial query. I'd still like to hear more. I will
    summarize them in a week or so.

    Also, by "legal strings" I really only mean pronounceable strings. Since
    many drug names are neologisms, we don't have to worry about violating any
    other rules. As long as the name can be readily pronounced, it is a
    candidate to be a drug name. (There are other constraints of course, that I
    am not going into.)

    -bruce

    At 10:12 AM 1/27/2004 +0000, Geoff Sampson wrote:
    >If you feel disappointed by what you have managed to find out to date, I think
    >this is probably because you are seeing it as a question with a sharply
    >defined
    >(though unknown) answer: a given sequence is either legal or illegal; whereas
    >in fact it is a question of more or less natural, not black and white. "Q
    >must be followed by U" looks like a 100% English rule, but people interested
    >in aromatherapy and allied trades these days are frequently using the word
    >"qi" borrowed from Chinese -- they don't always or even usually italicize it
    >as a foreign word, and if we said any words borrowed from other languages
    >don't count we wouldn't have much English left. Furthermore, the constraints
    >are not just "local" but longer-range. The sequence "io" is common enough
    >in English, for instance in the suffix "-ation", but I think I'm right in
    >saying that "io" will only occur in words based on Latin or other non-native
    >roots; whereas the letter "w" will never occur in roots from Latin or Greek.
    >So is "walition" a legal English word? Each syllable looks normal enough, but
    >as a linguist I would wonder "what could the etymology of that possibly be?"
    >
    >This doesn't make your question a meaningless one -- far from it. But it is
    >one to which the answer can only be a broad order of magnitude rather than
    >an exact number, and it is much more complicated to estimate that figure than
    >it might seem to be. I don't know any place where someone has tried to do it;
    >it is not obvious why an academic linguist would want to.
    >
    >
    >Geoffrey Sampson MA PhD MBCS ILTM
    >Professor of Natural Language Computing
    >
    >Department of Informatics
    >University of Sussex
    >Falmer, Brighton BN1 9QH, England
    >
    >t +44 1273 678525
    >f +44 1273 671320
    >w www.grsampson.net
    >
    >e-mail address no longer shown to avoid spam flood



    This archive was generated by hypermail 2b29 : Tue Jan 27 2004 - 19:10:45 MET