Re: [Corpora-List] n-grams (follow-up question)

From: ken Church (kwc@research.att.com)
Date: Wed Aug 28 2002 - 15:43:48 MET DST

Next message: andrius@ccl.bham.ac.uk: "[Corpora-List] N-gram extraction: Found it!"

Previous message: Christer Johansson: "Re: [Corpora-List] N-gram string extraction"
In reply to: Dirk Ludtke: "[Corpora-List] n-grams (follow-up question)"
Next in thread: Dirk Ludtke: "[Corpora-List] summary n-grams (follow-up question)"
Next in thread: andrius@ccl.bham.ac.uk: "Re: [Corpora-List] N-gram string extraction"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This is a great question! It is closely related to the literature on
adaptation (e.g., http://citeseer.nj.nec.com/rosenfeld96maximum.html).
Given that a term has appeared once before, what is the chance that it will
appear again? In general, terms that have appeared in the recent history
are very likely to be repeated again in the near future, much more than you
would expect by chance (Poisson).

You might be interested in two papers on my home page:
http://www.research.att.com/~kwc/

(1) Poisson Mixtures, and
(2) Empirical Estimates of Adaptation: The chance of Two Noriega's is closer
to p/2 than p^2

The first paper assumes a parametric model (e.g., Poisson or a Mixture of
Poissons such as a negative binomial) for how words are distributed in text.
Given such a model, it is relatively easy to fit the statistical parameters
to the data and use the fit to compute answers to questions like the one you
pose below (what is the probability that a term (= ngram) will appear
exactly k times or what is the probability that it will appear again).

The second paper was published 5 years later. It gets to many of the same
questions but avoids the troublesome parametric assumptions. The
assumptions are troublesome because people feel uneasy about them and
because they make the math a little complicated. If you are going to read
just one of these two papers, I would start with the second.

I have also written some on the earlier question in this thread -- that is,
how to compute frequencies of long ngrams in a large corpus. Folks might be
interested in "Using suffix arrays to compute term frequency and document
frequency for all substrings in a corpus" which can also be found on my
home page. But I gather that what is really being requested is a simple
software package that can be downloaded and run as is. As mentioned earlier
on this thread, there are some tutorials (also on my home page) that might
be helpful.

----- Original Message -----
From: "Dirk Ludtke" <dludtke@pine.kuee.kyoto-u.ac.jp>
To: <corpora@hd.uib.no>
Sent: Wednesday, August 28, 2002 2:13 AM
Subject: [Corpora-List] n-grams (follow-up question)

> A slightly related question:
>
> I am wondering if anyone could point me to work on n-gram reoccurance.
>
> A word (or n-gram) occurs k times in a corpus of n words. What is the
> probability that this word occurs again?
>
> Especially for small k, this probability seems to depend not only on k
> and n, but also on the ratio of words with low and high frequency.
>
> Is there a nice way to approximate these probabilities. Maybe with
> probability distributions? Is there a mathematic theory?
>
> Thank you.
>
>

Next message: andrius@ccl.bham.ac.uk: "[Corpora-List] N-gram extraction: Found it!"
Previous message: Christer Johansson: "Re: [Corpora-List] N-gram string extraction"
In reply to: Dirk Ludtke: "[Corpora-List] n-grams (follow-up question)"
Next in thread: Dirk Ludtke: "[Corpora-List] summary n-grams (follow-up question)"
Next in thread: andrius@ccl.bham.ac.uk: "Re: [Corpora-List] N-gram string extraction"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Aug 28 2002 - 15:59:44 MET DST