> But then again, why not go simply to UPenn and purchase some
> license for English Gigaword plus some additional tens of millions
> words corpora from LDC?
For example because I'm also interested in 1 billion words of Italian,
German and Japanese? Or because I think that the web can give us a more
varied picture of a language than a newswire corpus? But more in general
because I think that, with all the linguistic data available out there on
the web (probably orders of magnitude more data than the whole LDC and
ELDA catalogues put together), it is a good idea to develop/gather/share
tools and procedures to get them in "corpus format"...
Which of course does not mean that prefab corpora do not have their
function, as well.
Regards,
Marco
This archive was generated by hypermail 2b29 : Wed Jun 01 2005 - 01:01:18 MET DST