Re: [Corpora-List] Query on the use of Google for corpus research

From: Tom Emerson (tree@basistech.com)
Date: Tue May 31 2005 - 14:39:45 MET DST

  • Next message: Mark Dras: "[Corpora-List] Extension: IJCNLP05 Paraphrase Workshop (IWP2005)"

    Mark P. Line writes:
    > I hope that nobody unleashes impolite crawlers anywhere. That's why I
    > noted that there is a protocol that should be honored, so that crawlers
    > behave the way the webmasters wish.

    Following the robots exclusion protocol is only part of the issue,
    though. You also have to be sure you don't hit the site with tens (or
    hundreds) of requests per second (as one example.)

    > Obviously, that depends on what makes a crawler "useful" for one's
    > purposes. I was talking solely about the purpose of harvesting sample
    > material for corpus-based linguistic research, not for other purposes for
    > which much more sophisticated traversal of the Web may indeed be necessary
    > or desirable.

    You're right: it depends on one's needs. But the issues I raised are
    still problematic even on "simple" crawls.

    > Have you ever harvested a linguistic research corpus from the web starting
    > with that many seed URL's? Why?

    Yes. The work that we're doing in named entity extraction and
    automatic lexicon construction requires gigabytes of data, classified
    by language and in as many genres as possible. I have an ongoing crawl
    started from 2300+ seeds where I've so far collected 193 GB of raw
    HTML data, representing just under 9 million documents. The crawl has
    discovered some 21.7 million documents and continues to run.

    > What linguistic questions am I looking to answer with my corpus? Is it
    > better if I get less text from more sites or more text from fewer sites?
    > How many seeds did I really start with? Am I following off-site links?

    Exactly: and these questions mean that you need a highly configurable
    crawler that is scalable to thousands or tens of thousands of URLs.

    > Why would spider traps be a concern (apart from knowing to give up on the
    > site if my IP address has been blocked by their spider trap) when all I'm
    > doing is constructing a sample of text data from the Web?

    Marco Baroni answered this in his reply.

    > Maybe I would keep a list of the pages I'd already seen, and check the
    > list before I requested a page. :)
    >
    > (That might not be a scalable solution for all purposes, but it works fine
    > at the scale of corpus harvesting.)

    And what scale is that? The space required to track tens or hundreds
    of millions of URLs is significant.

    > > Or what if you want to recrawl certain sites with some regularity?
    >
    > Why would I want to do that when my task is to construct a research
    > corpus? Even if I did, it's not exactly rocket surgery. :)

    Because you may be building a synchronic corpus.

    > > What about sites that require login or cookies?
    >
    > Why would I worry about those sites when I'm just looking to put together
    > some sample texts for linguistic research?

    Because you may be building your corpus from sites that require
    registration (think the New York Times, assuming you ignore their
    robots.txt).

    > > How do you schedule the URLs to be crawled?
    >
    > Why would I schedule them if all I'm doing is harvesting corpus texts?

    Because starting with your seeds you will discover many more URLs than
    you can crawl at any one time. Let's say you start with 100 seed
    URLs. After crawling these you get five hundred new URLs that you may
    want to crawl. How do you determine which of these to crawl and in
    which order?

    Oh, and don't forget that you need to filter the content so that you
    don't download the latest batch of Linux ISOs because some idiot web
    master gave it a mime-type of text/plain. Or, perhaps more
    realistically, so you don't download PDF or Word files (unless you are
    wanting to deal with these.) And filtering on file name regexps (e.g.,
    "\.html?") does not always work, since many sites that may be of
    interest (think message boards) generating content from CGI scripts
    and don't have suffixes.

    > _Storing_ the volumes of data that is typical and adequate for corpus
    > linguistics would not be any more difficult when the data is coming from
    > the Web than when it is coming from anywhere else. It's _getting_ the data
    > that is different.

    Except we're talking about millions of small files. Few file systems
    handle this well, on any OS.

    > In any event, I'm not sure I've ever heard of a linguistic corpus with
    > millions of documents. If you have one, can I get it on DVD's?

    Well, no, because of the IP issues.

    > I do know for a fact, however, that corpus linguists do not need scalable,
    > high-performance crawlers in order to construct very useful research
    > corpora from the Web.

    Amen. But there is more than just the crawler. Post-processing the
    data is very resource intensive. I mentioned earlier that I have 193
    GB of raw HTML in a crawl I'm doing now. From what I've seen in the
    past, I expect that 35-40% of that space will just disappear when I
    remove markup from the documents. Indeed, perhaps much more. Then you
    toss out duplicate and near-duplicate documents, and the amount will
    go down even more. Then you toss out content in languages that I don't
    care about, or error pages, or other chaff, and the remaing wheat will
    be a lot less. If I end up with 10 GB of usable data, I'll be happy.

        -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever"
    



    This archive was generated by hypermail 2b29 : Tue May 31 2005 - 15:07:31 MET DST