Re: [Corpora-List] Query on the use of Google for corpus research

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Wed Jun 01 2005 - 15:35:37 MET DST

Next message: Nancy Ide: "Re: [Corpora-List] Query on the use of Google for corpus research"

Reply: Nancy Ide: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> On May 31, 2005, at 6:56 PM, Marco Baroni wrote:
> > it is a good idea to develop/gather/share
> > tools and procedures to get them in "corpus format"...
>
> I have not followed this discussion very closely, so forgive me if I
> am asking the obvious--but I wonder what you mean by "corpus format"?

Sorry if I was vague. I meant something like: to transform raw data
gathered from the web into something that can be used as a corpus.
Minimally, that would mean making sure that all documents are in the same
character encoding, I guess, but of course a good deal of post-processing
(html/boilerplate stripping, (near-)duplicate detection, language
identification...), annotation (POS, lemmatization, meta-information...),
indexing with CWB or XAIRA or similar tools, etc., would be highly
desirable.

Regards,

Marco

Next message: Nancy Ide: "Re: [Corpora-List] Query on the use of Google for corpus research"
Reply: Nancy Ide: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jun 01 2005 - 16:04:05 MET DST