Re: [Corpora-List] Query on the use of Google for corpus research

From: Dominic Widdows (widdows@maya.com)
Date: Mon May 30 2005 - 19:27:59 MET DST

  • Next message: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"

    Hi Mark,

    Thanks for your response, it certainly sounds like a hopeful direction.

    > Actually, I don't think it's really true anymore that large-scale
    > corpus
    > extraction from the Web necessarily puts you at the mercy of commercial
    > search engines. It's no longer very difficult to throw together a
    > software
    > agent that will crawl the Web directly.

    But is it not quite difficult to "throw something together" that
    doesn't cause all sorts of traffic problems? I have always shied away
    from actually trying this, under the impression that it's a bit of a
    dangerous art, but then this is certainly partly due to ignorance.

    > (IOW: The indexing part of
    > commercial search engines may be rocket science, but the harvesting
    > part
    > of them is not.)

    That's intriguing, as someone who's worked more in indexing, I'd have
    said precisely the opposite :-)
    Delighted if I'm wrong.

    Is there good reliable software out there, for those who would still be
    fearful of hacking up a harvester for themselves?
    There is the Internet Archive's Heritrix crawler
    (http://crawler.archive.org/). Has anyone used this and found it
    suitable for linguistic purposes?

    > I think that if you describe your harvesting procedure accurately (what
    > you seeded it with, and what filters you used if any), and monitor and
    > report on a variety of statistical parameters as your corpus is
    > growing,
    > there's no reason why the resulting data wouldn't serve as an adequate
    > sample for many purposes -- assuming that's what you meant by "vouch
    > for
    > them properly".

    Yes, that is part of what I meant. Do we have a good sense of what
    these statistical parameters should be? To what extent is there a code
    of practice for saying exactly what you did? Again, we run into
    standard empiricist questions - using your proposal, one could
    guarantee to reproduce the "initial conditions" of someone's
    experiment, but you could at best expect similar outcomes.

    This still leaves some of the traditional benefits of corpora
    unaccounted for - what about normalising the text content (presuming
    the traditional notion that text content is the linguistics phenomenon
    you're interested in), tagging, perhaps getting all the data into the
    same character set, etc.? These are some of the creature comforts that
    organizations such as the LDC have traditionally provided. We can
    provide adequate descriptions of what was done with the data, and I
    feel that we are even pretty good as a community at making the software
    we developed available to others (partly for selfish gene and "please
    cite my project!" reasons, but those motivations still benefit the
    community at large).

    However, there is still the problem that the more sophisticated stuff
    you throw at your data, the harder it is for anyone to replicate or
    extend your results, and ideally, I would like to see a system where
    the data itself is made available as a standard part of practice.
    Ideally, we would still work on the same datasets if possible, rather
    than duplicating similar datasets for each isoolated project. From an
    engineering point of view, storage isn't really a problem here, but
    bandwidth is - you have to keep the files you've trawled and processed
    on disk somewhere, but you might not be able to foot the bill for other
    researchers hitting your web server every time they fancy
    half-a-billion words of nice corpus data. To my mind, the only real
    solution to this part of the problem is going to be breaking your
    corpus up into smaller components and enabling other researchers to
    search and copy whichever parts they need in a peer-to-peer fashion. I
    gave a talk on this idea recently at the AAACL conference
    (http://infomap.stanford.edu/papers/distributed-corpora.pdf), but I
    guess this is another story really.

    Best wishes,
    Dominic



    This archive was generated by hypermail 2b29 : Mon May 30 2005 - 19:38:26 MET DST