Corpora: Re: Case/number distribution

Tomaz Erjavec (Tomaz.Erjavec@ijs.si)
Wed, 2 Dec 1998 16:46:10 +0100

Sean Boisen writes:
> If you have references, unpublished findings, or even informed speculations
> about the distributional facts for Greek/Russian/whatever case language
> you've got, i'd appreciate hearing them.

In MULTEXT-East <http://nl.ijs.si/ME/> we've produced annotated
corpora for some case rich languages (Slavic, Finno-Ugric). I did a
case count on the whatever (i.e. Slovene '1984'), and here are the
numbers of PoS / Case combinations:

5750 Noun nominative
4309 Noun genitive
4113 Noun accusative
3303 Adjective nominative
2982 Pronoun accusative
2618 Preposition locative
2384 Noun locative
2275 Pronoun no-case
2209 Pronoun nominative
1828 Preposition accusative
1563 Preposition instrumental
1410 Noun instrumental
1321 Preposition genitive
1307 Adjective accusative
1200 Adjective genitive
1041 Pronoun genitive
991 Pronoun dative
689 Pronoun locative
684 Adjective instrumental
649 Adjective locative
469 Noun dative
392 Pronoun instrumental
364 Numeral accusative
314 Numeral nominative
221 Preposition dative
120 Numeral genitive
118 Preposition no-case
113 Numeral locative
113 Adjective dative
67 Numeral instrumental
64 Numeral no-case
3 Noun no-case
3 Numeral dative

I guess the 'Noun no-case' could be errors, will have to have a look.

Hope this helps,
Tomaz

PS: you can also search on this corpus on
http://nl2.ijs.si/corpus/ Say '[case:"genitive"]' or similar.

-------------
Tomaz Erjavec | Dept. for Intelligent Systems E-8
email: tomaz.erjavec@ijs.si | Jozef Stefan Institute
www: http://nl.ijs.si/tomaz/ | Jamova 39
tel: (+386 61) 177-3-507 | SI-1000 Ljubljana
fax: (+386 61) 219-385 | Slovenia