Re: Corpora: Looking for Afrikaans, Swedish, Romanian and Icelandic

Philip Resnik (resnik@umiacs.umd.edu)
Tue, 11 Nov 1997 19:36:38 -0500 (EST)

> Back in August I put out a plea for help in getting the corpora
> mentioned in the subject line. Though I thank all of those who
> responded, I have so far been unsuccessful at getting what I need.
>
> I have no Afrikaans or Icelandic.

Someone may already have mentioned this in reply to your earlier query,,
but the AltaVista web search engine (www.altavista.digital.com) allows
you to select the desired language of the documents to be retrieved, and
Icelandic, Romanian, and Swedish are among the languages you can select.
So, for example, selecting Icelandic and giving a search string like
"Web" you get 376 documents matching the query; this is not a huge
number, but you can easily write a robot or spider program to follow
links further (e.g. see <URL:http://www.w3.org/Robot/>) or do multiple
manual searches of this kind using keywords of interest.

This will produce a *collection* of documents, which is not, strictly
speaking, a *corpus*, at least according to some definitions. However,
since you specify your information need in number of characters, I
suspect the method I've just described might still be useful; certainly
if you have NO Icelandic this will provide an improvement over what
you've currently got! :-)

Philip

----------------------------------------------------------------
Philip Resnik, Assistant Professor
Department of Linguistics and Institute for Advanced Computer Studies

1401 Marie Mount Hall UMIACS phone: (301) 405-6760
University of Maryland Linguistics phone: (301) 405-8903
College Park, MD 20742 USA Fax : (301) 405-7104
http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu