[Corpora-List] Tokenizer for English Web Corpus

From: Adriano Ferraresi (a.ferraresi@gmail.com)
Date: Tue Mar 13 2007 - 12:39:46 MET

Next message: Eric Atwell: "Re: [Corpora-List] corpus of Welsh English???"

Previous message: j.degroote@lancaster.ac.uk: "[Corpora-List] corpus of Welsh English???"
Next in thread: Andrew.Lampert@csiro.au: "RE: [Corpora-List] Tokenizer for English Web Corpus (and Email Data)"
Reply: Andrew.Lampert@csiro.au: "RE: [Corpora-List] Tokenizer for English Web Corpus (and Email Data)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi everybody,

I am currently embarking on a research project aiming at building a large
corpus of English by automatic crawls of the web. For this purpose I would
be interested in having some suggestions about an efficient tokenizer for
English. This should in some way take into account specific aspects of Web
writing (such as the treatment of emoticons, typos, commonly used
abbreviations, etc.). Does anyone know about a similar tool?

I will provide a resume of the answers I (hopefully!) will get.

Thank you.

Adriano Ferraresi

Next message: Eric Atwell: "Re: [Corpora-List] corpus of Welsh English???"
Previous message: j.degroote@lancaster.ac.uk: "[Corpora-List] corpus of Welsh English???"
Next in thread: Andrew.Lampert@csiro.au: "RE: [Corpora-List] Tokenizer for English Web Corpus (and Email Data)"
Reply: Andrew.Lampert@csiro.au: "RE: [Corpora-List] Tokenizer for English Web Corpus (and Email Data)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Mar 13 2007 - 12:36:55 MET