Re: [Corpora-List] Query on the use of Google for corpus research

From: Mark P. Line (mark@polymathix.com)
Date: Mon May 30 2005 - 23:45:04 MET DST

  • Next message: John Colby: "[Corpora-List] Parsing schemata for HPSG, LFG?"

    Tom Emerson said:
    > Mark P. Line writes:
    >> There's a protocol for robotic web crawlers that you should honor,
    >> whereby websites can specify how they wish such crawlers to behave when
    >> their site is encountered during a crawl. Other than that, I wouldn't
    >> worry too much about traffic caused by your harvesting. Kids build web
    >> mining applications in Java 101 these days. Heck, they're probably doing
    >> it in high school. *shrug*
    >
    > This is, with all due respect, a very naive thing to say. If every
    > research group decided to unleash impolite crawlers on the world's
    > websites I can guarantee that you will get a lot of hostile email very
    > quickly from the web masters.

    I hope that nobody unleashes impolite crawlers anywhere. That's why I
    noted that there is a protocol that should be honored, so that crawlers
    behave the way the webmasters wish.

    > Writing a useful crawler is a lot more difficult than you let on,
    > especially if you plan on crawling a non-trivial number of sites.

    Obviously, that depends on what makes a crawler "useful" for one's
    purposes. I was talking solely about the purpose of harvesting sample
    material for corpus-based linguistic research, not for other purposes for
    which much more sophisticated traversal of the Web may indeed be necessary
    or desirable.

    >> My take is that indexing can usefully be as (linguistically or
    >> otherwise) sophisticated as anybody cares and has the money to make it
    >> (once you've actually captured the text), whereas harvesting tends to
    >> gain little from anything but the most rudimentary filtering.
    >
    > This is also rather naive. Let's say you start a crawl with 2300 seed
    > URLs.

    Have you ever harvested a linguistic research corpus from the web starting
    with that many seed URL's? Why?

    > How deep into a site do you go?

    What linguistic questions am I looking to answer with my corpus? Is it
    better if I get less text from more sites or more text from fewer sites?
    How many seeds did I really start with? Am I following off-site links?

    > How do you deal with spider traps?

    Why would spider traps be a concern (apart from knowing to give up on the
    site if my IP address has been blocked by their spider trap) when all I'm
    doing is constructing a sample of text data from the Web?

    > Do you follow links outside of the seed's site?

    Probably. What linguistic questions am I looking to answer with my corpus?
    Where did I get my seed URL's?

    > How do you prevent yourself from crawling the same content more than
    > once?

    Maybe I would keep a list of the pages I'd already seen, and check the
    list before I requested a page. :)

    (That might not be a scalable solution for all purposes, but it works fine
    at the scale of corpus harvesting.)

    > Or what if you want to recrawl certain sites with some regularity?

    Why would I want to do that when my task is to construct a research
    corpus? Even if I did, it's not exactly rocket surgery. :)

    > What about sites that require login or cookies?

    Why would I worry about those sites when I'm just looking to put together
    some sample texts for linguistic research?

    > How do you schedule the URLs to be crawled?

    Why would I schedule them if all I'm doing is harvesting corpus texts?

    > How do you store the millions of documents that you download?

    _Storing_ the volumes of data that is typical and adequate for corpus
    linguistics would not be any more difficult when the data is coming from
    the Web than when it is coming from anywhere else. It's _getting_ the data
    that is different.

    In any event, I'm not sure I've ever heard of a linguistic corpus with
    millions of documents. If you have one, can I get it on DVD's?

    > In any event, I expect that the people behind Heritrix or UbiCrawler
    > or any of the other scalable, high-performance crawlers will disagree
    > with your glib dismissal of their area of expertise.

    I don't believe I dismissed anybody's expertise, glibly or otherwise -- if
    I stepped on somebody's toes unintentionally, then I apologize.

    I do know for a fact, however, that corpus linguists do not need scalable,
    high-performance crawlers in order to construct very useful research
    corpora from the Web.

    -- Mark

    Mark P. Line
    Polymathix
    San Antonio, TX



    This archive was generated by hypermail 2b29 : Mon May 30 2005 - 23:57:31 MET DST