Re: [Corpora-List] word frequencies on the web

From: radev@umich.edu
Date: Fri Dec 08 2006 - 17:51:25 MET

  • Next message: Alexander Schutz: "Re: [Corpora-List] word frequencies on the web"

    Have you seen this release from Google:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13

    Introduction

    This data set, contributed by Google Inc., contains English word
    n-grams and their observed frequency counts. The length of the n-grams
    ranges from unigrams (single words) to five-grams. We expect this data
    will be useful for statistical language modeling, e.g., for machine
    translation or speech recognition, as well as for other uses.

    Source Data

    The n-gram counts were generated from approximately 1 trillion word
    tokens of text from publicly accessible Web pages.

    >
    > Dear all, does anyone know of ways to estimate the frequency of words
    > on the web, or if there're search engines that supply this info (as
    > Altavista used to do)?
    >
    > thank you!
    > tony
    > www2.lael.pucsp.br/~tony
    >
    >
    >
    >

    -- 
    Dragomir R. Radev                    Associate Professor
    SI, CSE, Ling                     U. Michigan, Ann Arbor 
    http://www.eecs.umich.edu/~radev         radev@umich.edu              
    



    This archive was generated by hypermail 2b29 : Fri Dec 08 2006 - 17:53:47 MET