[Corpora-List] entropy of text

From: Dinoj Surendran (dinoj@cs.uchicago.edu)
Date: Wed Feb 19 2003 - 01:06:11 MET

Next message: Stefan Th. Gries: "[Corpora-List] Coding in ICE-GB"

Previous message: delucca@nilc.icmc.usp.br: "[Corpora-List] WWW.DICTIONARIUM.COM"
In reply to: delucca@nilc.icmc.usp.br: "[Corpora-List] WWW.DICTIONARIUM.COM"
Next in thread: J R Elliott: "Re: [Corpora-List] entropy of text"
Reply: J R Elliott: "Re: [Corpora-List] entropy of text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello everyone,

Suppose you have a text involving C character types and N character tokens
(so for a large book C would be under 50 and N several thousands/millions)
and you want to compute the entropy of the text. Suppose further that you're
doing this by finding the limit of H_k/k for large k, where H_k is the
entropy of k-grams of the text. Naturally you can't take k very large if N
is small.

Can anyone point me to some good references on how large one can take k to
be for a given C and N (and possibly other factors)? I'm looking at C=40
and N=80 000.

Thanks,

Dinoj Surendran
Graduate Student
Computer Science Dept
University of Chicago

PS - while I'm here, does anyone know of any online, freely available,
large (>50 000) corpora of phoneme-transcribed spontaneous conversation?

I've got the switchboard one for American English.
http://www.isip.msstate.edu/projects/switchboard/
which has 80 000 phonemes syllabified into about 30 000 syllables.

Similar corpora for any language would be useful.

Next message: Stefan Th. Gries: "[Corpora-List] Coding in ICE-GB"
Previous message: delucca@nilc.icmc.usp.br: "[Corpora-List] WWW.DICTIONARIUM.COM"
In reply to: delucca@nilc.icmc.usp.br: "[Corpora-List] WWW.DICTIONARIUM.COM"
Next in thread: J R Elliott: "Re: [Corpora-List] entropy of text"
Reply: J R Elliott: "Re: [Corpora-List] entropy of text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Feb 19 2003 - 01:12:09 MET