[Corpora-List] Tokenizer for English Web Corpus

From: Adriano Ferraresi (a.ferraresi@gmail.com)
Date: Tue Mar 13 2007 - 12:39:46 MET

  • Next message: Eric Atwell: "Re: [Corpora-List] corpus of Welsh English???"

    Hi everybody,

    I am currently embarking on a research project aiming at building a large
    corpus of English by automatic crawls of the web. For this purpose I would
    be interested in having some suggestions about an efficient tokenizer for
    English. This should in some way take into account specific aspects of Web
    writing (such as the treatment of emoticons, typos, commonly used
    abbreviations, etc.). Does anyone know about a similar tool?

    I will provide a resume of the answers I (hopefully!) will get.

    Thank you.

    Adriano Ferraresi



    This archive was generated by hypermail 2b29 : Tue Mar 13 2007 - 12:36:55 MET