[Corpora-List] entropy of text

From: Dinoj Surendran (dinoj@cs.uchicago.edu)
Date: Wed Feb 19 2003 - 01:06:11 MET

  • Next message: Stefan Th. Gries: "[Corpora-List] Coding in ICE-GB"

    Hello everyone,

    Suppose you have a text involving C character types and N character tokens
    (so for a large book C would be under 50 and N several thousands/millions)
    and you want to compute the entropy of the text. Suppose further that you're
    doing this by finding the limit of H_k/k for large k, where H_k is the
    entropy of k-grams of the text. Naturally you can't take k very large if N
    is small.

    Can anyone point me to some good references on how large one can take k to
    be for a given C and N (and possibly other factors)? I'm looking at C=40
    and N=80 000.

    Thanks,

    Dinoj Surendran
    Graduate Student
    Computer Science Dept
    University of Chicago

    PS - while I'm here, does anyone know of any online, freely available,
    large (>50 000) corpora of phoneme-transcribed spontaneous conversation?

    I've got the switchboard one for American English.
    http://www.isip.msstate.edu/projects/switchboard/
    which has 80 000 phonemes syllabified into about 30 000 syllables.

    Similar corpora for any language would be useful.



    This archive was generated by hypermail 2b29 : Wed Feb 19 2003 - 01:12:09 MET