A slightly related question:
I am wondering if anyone could point me to work on n-gram reoccurance.
A word (or n-gram) occurs k times in a corpus of n words. What is the
probability that this word occurs again?
Especially for small k, this probability seems to depend not only on k
and n, but also on the ratio of words with low and high frequency.
Is there a nice way to approximate these probabilities. Maybe with
probability distributions? Is there a mathematic theory?
Thank you.
This archive was generated by hypermail 2b29 : Wed Aug 28 2002 - 08:24:08 MET DST