Re: [Corpora-List] license question

From: P Resnik (psresnik@gmail.com)
Date: Fri Aug 18 2006 - 20:59:37 MET DST

  • Next message: Peter Halacsy: "[Corpora-List] Re: license question"

    > Unfortunately, URLs are subject to two limitations:
    >
    > 1. They become "broken" whenever the web site or the
    > directory structure is changed.
    >
    > 2. Even when the URL is live, the content can be updated
    > and changed at any time.
    >
    > These two points make a collection of URLs a highly unstable
    > way to assemble or distribute a corpus. They make it impossible
    > for any analysis performed at one instant of time to be compared
    > with any analysis performed at another time.

    One potential solution to these problems is to distribute URLs on the
    Internet Archive's "wayback machine" (www.archive.org). If the URLs
    of interest are for pages that are present in the archive, locating a
    snapshot and confirming that the content is the same as your stored
    page should be relatively straightforward. The Internet Archive is
    not always the most reliable option, since pages are sometimes
    unavailable or may not have been included in their snapshots in the
    first place, but in my experience it's not too terrible.

    I adopted this solution because it's a lot safer than just ignoring
    copyright issues and distributing the pages, a lot easier than hunting
    down copyright permissions for a zillion Web pages, and generally
    better than using original URLs for the reasons noted above. For an
    example, take a look at the July 2003 Chinese-English corpus I made
    available in this way, at http://umiacs.umd.edu/~resnik/strand/.

      Philip



    This archive was generated by hypermail 2b29 : Mon Aug 21 2006 - 09:29:52 MET DST