I've been playing with Google searches for extracting texts in a
particular language from the Web without a lot of noise (i.e. few texts
that aren't in the desired language). Any comments on the utility of this
approach for more serious corpus research? Any improvements to the best
search criteria I've been able to come up with below? Any good search
criteria for languages not listed?
(If there's any interest at all, I'd be happy to collect searches like
these on a webpage somewhere.)
Examples:
Basque:
http://www.google.com/search?q=gandik+gana&ie=utf-8&oe=utf-8
Bislama/Pijin:
http://www.google.com/search?q=blong+stap&ie=utf-8&oe=utf-8
Catalan:
http://www.google.com/search?q=els+uns+unes&ie=utf-8&oe=utf-8
Indonesian
http://www.google.com/search?q=tidak+yang+karena&ie=utf-8&oe=utf-8
Letzebuergesch:
http://www.google.com/search?q=fir+eng+dat&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8
Malay:
http://www.google.com/search?q=tidak+yang+kerana&ie=utf-8&oe=utf-8
Malay/Indonesian:
http://www.google.com/search?q=tidak+yang&ie=utf-8&oe=utf-8
Mongolian:
http://www.google.com/search?q=%D0%B1%D0%B0%D0%B9%D0%BD%D0%B0+&ie=utf-8&oe=utf-8
Nahuatl:
http://www.google.com/search?q=auh+inic&ie=utf-8&oe=utf-8
North Frisian:
http://www.google.com/search?q=%C3%BC%C3%BCb+m%C3%A4+uun&ie=utf-8&oe=utf-8
Saami:
http://www.google.com/search?q=atte+son+ja+dat&ie=utf-8&oe=utf-8
Shona:
http://www.google.com/search?q=kusvika&ie=utf-8&oe=utf-8
Sorbian:
http://www.google.com/search?q=%C5%A1to%C5%BE&ie=utf-8&oe=utf-8
Tagalog:
http://www.google.com/search?q=%22ang+mga%22&ie=utf-8&oe=utf-8
Tok Pisin:
http://www.google.com/search?q=long+bilong&&ie=utf-8&oe=utf-8
Welsh:
http://www.google.com/search?q=cymraeg+mae&ie=utf-8&oe=utf-8
-- Mark
Mark P. Line
Polymathix
San Antonio, TX
This archive was generated by hypermail 2b29 : Mon Aug 30 2004 - 23:10:12 MET DST