Re: [Corpora-List] Query on the use of Google for corpus research

From: Nancy Ide (ide@cs.vassar.edu)
Date: Wed Jun 01 2005 - 17:06:21 MET DST

  • Next message: Marco Baroni: "Re: [Corpora-List] Query on the use of Google for corpus research"

    On Jun 1, 2005, at 9:35 AM, Marco Baroni wrote:

    > Sorry if I was vague. I meant something like: to transform raw data
    > gathered from the web into something that can be used as a corpus.
    > Minimally, that would mean making sure that all documents are in
    > the same
    > character encoding, I guess, but of course a good deal of post-
    > processing
    > (html/boilerplate stripping, (near-)duplicate detection, language
    > identification...), annotation (POS, lemmatization, meta-
    > information...),
    > indexing with CWB or XAIRA or similar tools, etc., would be highly
    > desirable.
    >

    We've actually done a lot of that in the process of developing the
    American National Corpus. We have gotten data off the web in several
    formats, but for our purposes the data has to be American English,
    produced post-1990, and not under any copyright constraints, so we
    are a bit more picky about what we download than the "web as corpus"
    approach dictates. We have a pipeline that takes data in most formats
    (PDF, Word, etc.) and strips out the text, does its best to identify
    titles, tables, etc. and mark them as such, and runs it through GATE
    (http://gate.ac.uk), (using some additional GATE plugins we've
    developed) to do tokenization, sentence splitting, POS tagging, noun
    and verb phrase chunking, etc. We dump it out in our XML stand-off
    format in UTF-16 for raw data and UTF-8 for annotations, but since
    the pipeline is modular any step can be replaced with another tool to
    do something differently. We also deal with HTML, but because users
    can use HTML tags any way they like (e.g. <p> and <font> tags for
    headers instead of <h1> etc.), and no two documents are ever the same
    (it seems), this is more labor-intensive. We also have a tool for
    near-duplicate detection, which was used on NYTimes data but might be
    generalizable.

    BTW the ANC can be used with XAIRA--see http://
    AmericanNationalCorpus.org/xaira.html, which provides a few pre-
    processing tools that enable indexing the ANC data in XAIRA.

    I am not sure if any of what we've done is useful to others, but we
    are happy to share anything we have.



    This archive was generated by hypermail 2b29 : Wed Jun 01 2005 - 17:22:54 MET DST