Re: [Corpora-List] n-grams (follow-up question)

From: ken Church (kwc@research.att.com)
Date: Wed Aug 28 2002 - 15:43:48 MET DST

  • Next message: andrius@ccl.bham.ac.uk: "[Corpora-List] N-gram extraction: Found it!"

    This is a great question! It is closely related to the literature on
    adaptation (e.g., http://citeseer.nj.nec.com/rosenfeld96maximum.html).
    Given that a term has appeared once before, what is the chance that it will
    appear again? In general, terms that have appeared in the recent history
    are very likely to be repeated again in the near future, much more than you
    would expect by chance (Poisson).

    You might be interested in two papers on my home page:
    http://www.research.att.com/~kwc/

    (1) Poisson Mixtures, and
    (2) Empirical Estimates of Adaptation: The chance of Two Noriega's is closer
    to p/2 than p^2

    The first paper assumes a parametric model (e.g., Poisson or a Mixture of
    Poissons such as a negative binomial) for how words are distributed in text.
    Given such a model, it is relatively easy to fit the statistical parameters
    to the data and use the fit to compute answers to questions like the one you
    pose below (what is the probability that a term (= ngram) will appear
    exactly k times or what is the probability that it will appear again).

    The second paper was published 5 years later. It gets to many of the same
    questions but avoids the troublesome parametric assumptions. The
    assumptions are troublesome because people feel uneasy about them and
    because they make the math a little complicated. If you are going to read
    just one of these two papers, I would start with the second.

    I have also written some on the earlier question in this thread -- that is,
    how to compute frequencies of long ngrams in a large corpus. Folks might be
    interested in "Using suffix arrays to compute term frequency and document
    frequency for all substrings in a corpus" which can also be found on my
    home page. But I gather that what is really being requested is a simple
    software package that can be downloaded and run as is. As mentioned earlier
    on this thread, there are some tutorials (also on my home page) that might
    be helpful.

    ----- Original Message -----
    From: "Dirk Ludtke" <dludtke@pine.kuee.kyoto-u.ac.jp>
    To: <corpora@hd.uib.no>
    Sent: Wednesday, August 28, 2002 2:13 AM
    Subject: [Corpora-List] n-grams (follow-up question)

    > A slightly related question:
    >
    > I am wondering if anyone could point me to work on n-gram reoccurance.
    >
    > A word (or n-gram) occurs k times in a corpus of n words. What is the
    > probability that this word occurs again?
    >
    > Especially for small k, this probability seems to depend not only on k
    > and n, but also on the ratio of words with low and high frequency.
    >
    > Is there a nice way to approximate these probabilities. Maybe with
    > probability distributions? Is there a mathematic theory?
    >
    > Thank you.
    >
    >



    This archive was generated by hypermail 2b29 : Wed Aug 28 2002 - 15:59:44 MET DST