RE: [Corpora-List] Tokenizer for English Web Corpus (and Email Data)

From: Andrew.Lampert@csiro.au
Date: Wed Mar 14 2007 - 05:22:29 MET

  • Next message: sciubba@uniroma3.it: "Re: [Corpora-List] corpus of German spoken interaction"

    Further to Adriano's request below, is anyone aware of sentence
    tokenizers/splitters that have been trained on or applied to email data?

     
    Some of the noise in email text will be similar to that of web text
    (emoticons, typos etc.), but there are also specific phenomena
    (greetings, email signatures, dealing with quoted material etc.) that
    seem to require techniques tailored to email.
     
    I await your summary of responses with interest, Adriano.
     
    Are there any additional pointers that people can offer, specifically
    with regard to processing email text?
     
    Thanks,
    Andrew Lampert
    --------------
    Andrew Lampert
    Research Engineer
    Information Engineering Laboratory
    CSIRO ICT Centre
    <http://www.ict.csiro.au/staff/Andrew.Lampert/>

    Post: Locked Bag 17, North Ryde, NSW 1670, Australia
    Office: Building E6B, Macquarie University, North Ryde, 2113
    Tel: +61 2 9325 3129, Fax: +61 2 9325 3200
      

      _____

    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Adriano Ferraresi
    Sent: Tuesday, 13 March 2007 10:40 PM
    To: CORPORA@UIB.NO
    Subject: [Corpora-List] Tokenizer for English Web Corpus

    Hi everybody,
     
    I am currently embarking on a research project aiming at building a
    large corpus of English by automatic crawls of the web. For this purpose
    I would be interested in having some suggestions about an efficient
    tokenizer for English. This should in some way take into account
    specific aspects of Web writing (such as the treatment of emoticons,
    typos, commonly used abbreviations, etc.). Does anyone know about a
    similar tool?
     
    I will provide a resume of the answers I (hopefully!) will get.
     
    Thank you.
     
    Adriano Ferraresi



    This archive was generated by hypermail 2b29 : Wed Mar 14 2007 - 05:24:05 MET