The 2nd edition of the IPI PAN Corpus of Polish, developed
at the Institute of Computer Science of the Polish Academy
of Sciences (PAS), is available at the web pages of:
- the Institute of Computer Science PAS:
http://korpus.pl/en/
- the Institute of Polish Language PAS:
http://corpus.ijp-pan.krakow.pl/en/
To the best of our knowledge, this is currently the largest
searchable morphosyntactically annotated corpus of Polish
available to the public.
The whole corpus consists of over 250 million segments
(about 200 million orthographic words) and it is not
balanced, but a balanced sample of over 30 million segments
is also available. These corpora can be directly searched
at the above addresses (do read the query syntax cheatsheet
at http://korpus.pl/en/cheatsheet/index.html) or downloaded
in a binary form to be used with a standalone version of the
corpus search engine Poliqarp (announced separately on the
'corpora' list). Note that the standalone Poliqarp offers
much greater functionality than the web interface (e.g., it
shows metadata, presents more results, etc.).
Best regards,
Adam P.
-- Adam Przepiorkowski http://nlp.ipipan.waw.pl/ ----- Linguistic Engineering Group http://korpus.pl/ ------------- the IPI PAN Corpus of Polish
This archive was generated by hypermail 2b29 : Thu Mar 23 2006 - 00:11:07 MET