Re: [Corpora-List] license question

From: Dominic Widdows (widdows@maya.com)
Date: Fri Aug 18 2006 - 23:10:00 MET DST

  • Next message: Serge Sharoff: "RE: [Corpora-List] license question"

    As you say, these are workarounds, and I don't think they answer the
    substance of John's objections. The Internet Archive might not have
    the data, and Google's cache doesn't make long term commitments to
    keep dated corpora for public use. (They might do so one day - their
    publication of the 5-gram corpus demonstrates a big step in the right
    direction, I hope.) Once you've copied the material to your own
    website, you are effectively back with the technology of copying the
    whole corpus rather than using references - and then, there are no
    guarantees that your copy will stay synchronized with the original.

    At the risk of beating a drum, I believe that we have prototyped the
    long-term solution to these problems at MAYA Design with an
    extensible peer-to-peer database. The idea of using this technology
    for language corpora is described in out LREC paper at
    http://www.maya.com/local/widdows/papers/lrec-distributed-corpora.pdf

    Represent individual texts as objects in a peer-to-peer network, and
    larger corpora using collections of universal references to these
    texts. But don't use location dependent URLs, because they're
    brittle, and they place the hosting costs on the worthy individuals
    who put the effort into gathering the corpus in the first place.
    Instead, use location independent universal identifiers (as the Free
    Software Foundation has done for years, and is now part of the
    official URN namespace), and encourage replication at the point of
    use. Use digital signatures to make sure that the data hasn't
    changed. If the publishing organization wants to go the whole way and
    make sure that the contents can never change, incorporate part of the
    digital signature into the identifier of each object. You can also
    use this as the core data for sharing standoff annotation,
    collaborative filtering, etc.

    And then you're more or less done - provided you can solve the peer-
    to-peer routing problem, and make sure that the economics of the
    system works well enough to encourage individuals and organizations
    to take part. These aren't trivial problems, of course - but the
    reliability will surely be better than URLs in the long run, and the
    economics will surely be more encouraging than "make the provider pay
    the bandwidth cost."

    Best wishes,
    Dominic

    On Aug 18, 2006, at 4:29 PM, Steven Bird wrote:

    > There's a couple of workarounds:
    >
    > Use an archive:
    > a) try to find all the URLs in the Internet Archive or Google's cache
    > b) submit missing URLs to such repositories (I think this can even be
    > done for Google's cache, by setting a very large expiry time.)
    >
    > Create an archive:
    > a) "mirror" a superset of the material on your own public website
    > b) publish URLs local to this site
    >
    > On 8/19/06, John F. Sowa <sowa@bestweb.net> wrote:
    >
    >> There is a serious problem with that approach:
    >>
    >> SS> This is why I advocate the procedure of distributing an
    >> > Internet-derived corpus as a list of URLs.
    >>
    >> Unfortunately, URLs are subject to two limitations:
    >>
    >> 1. They become "broken" whenever the web site or the
    >> directory structure is changed.
    >>
    >> 2. Even when the URL is live, the content can be updated
    >> and changed at any time.
    >>
    >> These two points make a collection of URLs a highly unstable
    >> way to assemble or distribute a corpus. They make it impossible
    >> for any analysis performed at one instant of time to be compared
    >> with any analysis performed at another time.
    >>
    >
    >
    >



    This archive was generated by hypermail 2b29 : Fri Aug 18 2006 - 23:26:45 MET DST