Re: Corpora: Looking for Afrikaans, Swedish, Romanian and Icelandic

Dan Melamed (melamed@unagi.cis.upenn.edu)
Wed, 27 Aug 1997 19:16:44 -0400 (EDT)

>
> I am in need of corpora in Afrikaans, Swedish, Romanian and Icelandic. I do not need much data, approximately 100,000 characters of running text would be more tan adequate. I could get by with as little as 20,000 characters, if that is all that is readily available.

> Can anyone point me to these corpora?

The main search widget on AltaVista now allows you to select the
language of the retrieved documents. Afrikaans is not on the menu,
but Romanian and Icelandic are. I selected Icelandic and entered a
term that's likely to appear in alot of WWW documents, namely `www*`.
I got just under 14000 hits. It shouldn't be that hard to write a
java script to collect them. Or else, you might want to just browse
the pointers by hand to find a few sufficiently long documents.

Happy hunting.

I. Dan Melamed melamed@linc.cis.upenn.edu
University of Pennsylvania http://www.cis.upenn.edu/~melamed/