Re: [Corpora-List] web-corpora, big and small

From: Mark P. Line (mark@polymathix.com)
Date: Thu Jun 02 2005 - 02:39:55 MET DST

  • Next message: Bart Defrancq: "[Corpora-List] Constitution"

    Marco Baroni said:
    >
    > I'll tell you more: there is a link to it from the faqs of heritrix,
    > probably the most popular publicly available crawler. We do this so that
    > the simple-minded crawlers written by naive Java developers are doomed.
    > ;-)

    Isn't it amazing what a difference one little smiley can make. Just
    imagine if you'd forgotten to add it: people might think you were
    suggesting that somebody here is a naive Java developer who writes
    simple-minded crawlers. Fortunately, given the imperviousness of the
    smiley hedge, nobody could possibly think you were suggesting that. Thanks
    to the adamantine shield of the smiley, everybody will think you are
    making a good-humored and well-intentioned joke.

    > Well, for example I would think that corpus-based ontology building,
    > lexicon extraction and named entity recognition qualify as legit
    > activities for corpus linguists, whereas I gather from your replies to
    > Tom Emerson that you are very confident that a corpus linguist could not
    > possibly be interested in that.

    Why would you gather that (other than the fact that you started reading at
    mid-thread)? I've said some things about the typical needs of corpus
    linguists and about the particular problem domain being discussed in this
    thread. How do you get from that to an assertion that no corpus linguist
    could possibly be interested in anything else? I've written quite the
    opposite more than once in this thread, saying that there's a break-even
    point where home-grown tools will have to make way for high-performance
    off-the-shelf tools. What's the need for hyperbole here?

    >> By what procedure did you arrive at 1 billion words as your required
    >> sample size? Why not 500 million or 5 billion?
    >
    > 1 billion words is an arbitrary starting point -- chosen to be as big as
    > the largest existing corpora we are aware of.

    So, you give higher priority to being able to show that yours is as big as
    anybody else's than to efficient allocation of your time and money?

    It's not about how big it is, it's all about how you use it.

    > I certainly do not hesitate to ask specific questions to this or other,
    > sometimes more appropriate lists (such as the heritrix crawler list), and
    > I'm glad that corpus linguists and crawlers are such friendly and helpful
    > comminities. My point was simply that retrieving large-ish corpora from
    > the web (at least if you want them to be composed of non-duplicate,
    > natural, connected text) is not a trivial task, as I (mis?)understood you
    > were implying.

    Aren't you mincing words here? Okay, I can play.

    I have said that it's not difficult to build software to do this, and it's
    not. If you disagree and want to debate the point, then you should try to
    show why it's difficult. (Showing that it's difficult for _you_ is not
    enough: you should show that it's difficult in principle.) Stating that
    the task is not trivial does not rebut the claim that it's not difficult,
    because lots of tasks (including most everyday software development tasks)
    are neither trivial nor difficult.

    -- Mark

    Mark P. Line
    Polymathix
    San Antonio, TX



    This archive was generated by hypermail 2b29 : Thu Jun 02 2005 - 02:49:35 MET DST