[Corpora-List] Google releases their database of N-grams

From: John F. Sowa (sowa@bestweb.net)
Date: Fri Aug 04 2006 - 23:50:41 MET DST

  • Next message: Shlomo Argamon: "[Corpora-List] CFP: Chicago Colloquium on Digital Humanities and Computer Science"

    Google, one of the world's biggest data collectors anywhere, is
    releasing their collection of 5-grams as freely available data.
    Anyone who is interested in doing research on techniques that
    use N-grams can now wallow in an ocean of data.

    Following is an excerpt from the Google announcement.

    John Sowa
    __________________________________________________________________

    http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

    Google Research

    All Our N-gram are Belong to You

    8/03/2006 11:26:00 AM
    Posted by Alex Franz and Thorsten Brants,
    Google Machine Translation Team

    Here at Google Research we have been using word n-gram models for a
    variety of R&D projects, such as statistical machine translation, speech
    recognition, spelling correction, entity detection, information
    extraction, and others. While such models have usually been estimated
    from training corpora containing at most a few billion words, we have
    been harnessing the vast power of Google's datacenters and distributed
    processing infrastructure to process larger and larger training corpora.
    We found that there's no data like more data, and scaled up the size of
    our data by one order of magnitude, and then another, and then one more
    - resulting in a training corpus of one trillion words from public Web
    pages.

    We believe that the entire research community can benefit from access to
    such massive amounts of data. It will advance the state of the art, it
    will focus research in the promising direction of large-scale,
    data-driven approaches, and it will allow all research groups, no matter
    how large or small their computing resources, to play together. That's
    why we decided to share this enormous dataset with everyone. We
    processed 1,011,582,453,213 words of running text and are publishing the
    counts for all 1,146,580,664 five-word sequences that appear at least 40
    times. There are 13,653,070 unique words, after discarding words that
    appear less than 200 times.

    Watch for an announcement at the LDC, who will be distributing it soon,
    and then order your set of 6 DVDs. And let us hear from you - we're
    excited to hear what you will do with the data, and we're always
    interested in feedback about this dataset, or other potential datasets
    that might be useful for the research community.



    This archive was generated by hypermail 2b29 : Fri Aug 04 2006 - 23:50:05 MET DST