Re: [Corpora-List] Query on the use of Google for corpus research

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Wed Jun 01 2005 - 15:35:37 MET DST

  • Next message: Nancy Ide: "Re: [Corpora-List] Query on the use of Google for corpus research"

    > On May 31, 2005, at 6:56 PM, Marco Baroni wrote:
    > > it is a good idea to develop/gather/share
    > > tools and procedures to get them in "corpus format"...
    >
    > I have not followed this discussion very closely, so forgive me if I
    > am asking the obvious--but I wonder what you mean by "corpus format"?

    Sorry if I was vague. I meant something like: to transform raw data
    gathered from the web into something that can be used as a corpus.
    Minimally, that would mean making sure that all documents are in the same
    character encoding, I guess, but of course a good deal of post-processing
    (html/boilerplate stripping, (near-)duplicate detection, language
    identification...), annotation (POS, lemmatization, meta-information...),
    indexing with CWB or XAIRA or similar tools, etc., would be highly
    desirable.

    Regards,

    Marco



    This archive was generated by hypermail 2b29 : Wed Jun 01 2005 - 16:04:05 MET DST