Re: Corpora: WWW-based corpus and ethics

Philip Resnik (resnik@umiacs.umd.edu)
Sat, 17 Apr 1999 22:09:35 -0400 (EDT)

> I'm building a specialised corpus of instructional materials on the WWW
> written by academics, to use in a monitor corpus for a project I'm
> doing.
>
> 1. Is it legal to do this without their consent?

Technically speaking, it is a violation of copyright to download and
then redistribute material found on the Web. (In fact, you might be
surprised to learn that technically I've violated at least American
copyright law by including the above five lines, introduced by '>',
in this reply. But so I have been informed by a legal person at this
university, not that anyone would enforce such nonsense.)

> What if they assert copyright -- am I still allowed to download the
> whole lot, even if only for non-profit research purposes?

I believe "fair use" will allow you to download a single copy for
research purposes, though I could be wrong.

> 2. Apart from 'Is it legal?', I want to ask, 'Is it ethical?' to do
> this?

Given that the number of authors you're talking about is probably
relatively small (at most dozens?), I believe it would be appropriate
for you to at least e-mail a brief note informing them of what you're
doing and providing an opportunity for them to tell you how they feel
about it, and of course I think you need to carefully document your
sources (don't forget to record the download date!). As someone who
posts course materials on the WWW I don't imagine I'd be bothered by
your downloading it for your own research use, but I do think I might
be unhappy to have the materials widely disseminated without my
permission.

On a related note, I'm doing work that involves building a collection
(based on previous discussions on this list I avoid the word "corpus")
from the WWW, specifically pairs of Web pages in parallel translation,
and when the time comes to disseminate it I plan not to redistribute
the pages themselves, which is questionable in terms of copyright as
noted above, but rather to make available the database of URLs so
people can download pages themselves. If *that's* a violation of
copyright, then Yahoo, Altavista, and the rest are in hot water...

I'll be interested in what others have to say.

Best,

Philip

----------------------------------------------------------------
Philip Resnik, Assistant Professor
Department of Linguistics and Institute for Advanced Computer Studies

1401 Marie Mount Hall UMIACS phone: (301) 405-6760
University of Maryland Linguistics phone: (301) 405-8903
College Park, MD 20742 USA Fax : (301) 405-7104
http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu