[Corpora-List] language-specific harvesting of texts from the Web

From: Mark P. Line (mark@polymathix.com)
Date: Mon Aug 30 2004 - 22:51:02 MET DST

Next message: Marco Baroni: "Re: [Corpora-List] language-specific harvesting of texts from the Web"

Previous message: Yuri Tambovtsev: "[Corpora-List] Sanskrit texts in electronic form in need"
Next in thread: Marco Baroni: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Reply: Marco Baroni: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Reply: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I've been playing with Google searches for extracting texts in a
particular language from the Web without a lot of noise (i.e. few texts
that aren't in the desired language). Any comments on the utility of this
approach for more serious corpus research? Any improvements to the best
search criteria I've been able to come up with below? Any good search
criteria for languages not listed?

(If there's any interest at all, I'd be happy to collect searches like
these on a webpage somewhere.)

Examples:

Basque:
http://www.google.com/search?q=gandik+gana&ie=utf-8&oe=utf-8

Bislama/Pijin:
http://www.google.com/search?q=blong+stap&ie=utf-8&oe=utf-8

Catalan:
http://www.google.com/search?q=els+uns+unes&ie=utf-8&oe=utf-8

Indonesian
http://www.google.com/search?q=tidak+yang+karena&ie=utf-8&oe=utf-8

Letzebuergesch:
http://www.google.com/search?q=fir+eng+dat&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8

Malay:
http://www.google.com/search?q=tidak+yang+kerana&ie=utf-8&oe=utf-8

Malay/Indonesian:
http://www.google.com/search?q=tidak+yang&ie=utf-8&oe=utf-8

Mongolian:
http://www.google.com/search?q=%D0%B1%D0%B0%D0%B9%D0%BD%D0%B0+&ie=utf-8&oe=utf-8

Nahuatl:
http://www.google.com/search?q=auh+inic&ie=utf-8&oe=utf-8

North Frisian:
http://www.google.com/search?q=%C3%BC%C3%BCb+m%C3%A4+uun&ie=utf-8&oe=utf-8