Re: [Corpora-List] license question

From: Steven Bird (sb@csse.unimelb.edu.au)
Date: Fri Aug 18 2006 - 22:29:11 MET DST

Next message: Dominic Widdows: "Re: [Corpora-List] license question"

Previous message: John F. Sowa: "Re: [Corpora-List] license question"
In reply to: John F. Sowa: "Re: [Corpora-List] license question"
Next in thread: Dominic Widdows: "Re: [Corpora-List] license question"
Reply: Dominic Widdows: "Re: [Corpora-List] license question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

There's a couple of workarounds:

Use an archive:
a) try to find all the URLs in the Internet Archive or Google's cache
b) submit missing URLs to such repositories (I think this can even be
done for Google's cache, by setting a very large expiry time.)

Create an archive:
a) "mirror" a superset of the material on your own public website
b) publish URLs local to this site

On 8/19/06, John F. Sowa <sowa@bestweb.net> wrote:
> There is a serious problem with that approach:
>
> SS> This is why I advocate the procedure of distributing an
> > Internet-derived corpus as a list of URLs.
>
> Unfortunately, URLs are subject to two limitations:
>
> 1. They become "broken" whenever the web site or the
> directory structure is changed.
>
> 2. Even when the URL is live, the content can be updated
> and changed at any time.
>
> These two points make a collection of URLs a highly unstable
> way to assemble or distribute a corpus. They make it impossible
> for any analysis performed at one instant of time to be compared
> with any analysis performed at another time.

Next message: Dominic Widdows: "Re: [Corpora-List] license question"
Previous message: John F. Sowa: "Re: [Corpora-List] license question"
In reply to: John F. Sowa: "Re: [Corpora-List] license question"
Next in thread: Dominic Widdows: "Re: [Corpora-List] license question"
Reply: Dominic Widdows: "Re: [Corpora-List] license question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Aug 18 2006 - 22:27:44 MET DST