RE: Corpora: Case/number distribution

Christopher A. Brewster (brewster@upatras.gr)
Wed, 2 Dec 1998 15:04:07 +0200

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Yaari Yaakov: "Corpora: Brill's tagger for DOS"
Previous message: Magnar Brekke: "Re: Corpora: history of corpora"

This is a multi-part message in MIME format.

------=_NextPart_000_0002_01BE1E05.0414E2E0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

It is interesting that you should mention Modern Greek. In working on the
Collins English-Greek/Greek-English dictionary, we used a corpus to
construct the Greek framework. Initially we thought that we would need a
morphological generation system (such as that used in spell checkers) to
have on-line access to all word forms for concordancing purposes. However,
the actual figure from the corpus showed that constructing this to be a
waste of time.

Theoretical Form-Function pairs Word-forms Actual Number present
(approx.)
VERB 140 tense forms 44 tense types 23 tense types
e.g. 'dheno' 96 participle forms 44 participle types 13 participle types
total 236 total 88 total 36

ADJECTIVE 24 forms 24 forms 11 types
e.g. 'kalos'
BUT
'isichos' 11 types 11 types 7 types

NOUN
e.g. 'lagos' 8 forms 7 types 7 types

Overall the corpus, when it contained 9.5 million words (it was subsequently
expanded) included 220 000 types including about 40 000 items of rubbish
including symbols, misprints, foreign words and other incomprehensible
items. The remaining 180 000 wordforms is very small compared to the 900 000
wordforms which can be derived (in theory) from the 50 000 headwords of a
standard dictionary:

Total headwords 50 000

Headwords Possible forms
NOUNS 36 000 252 000
ADJECTIVES 9 000 216 000
VERBS 5 000 440 000

TOTAL: 908 000

I wrote a paper on these phenomena, but I never published it. If anyone
wants a copy I can send it to them.

I hope this is useful. I think this is quite important when dealing with
moderately or highly inflected languages. I would be curious to see what the
figures are for Czech or Hebrew.

Christopher Brewster

Foreign Language Teaching Centre,
University of Patras, Patras,
Greece, GR 26 500
tel: +30 61 623038
email: brewster@upatras.gr

> -----Original Message-----
> From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
> Behalf Of Sean Boisen
> Sent: Tuesday, December 01, 1998 10:30 PM
> To: 'Corpora List'
> Subject: Corpora: Case/number distribution
>
>
> I'm looking for references to work on the distribution of forms across
> inflectional categories in languages with case systems. For
> example, Modern
> Greek (according to Joseph, in _The World's Major Languages_, ed.
> by Comrie
> 1990) has 4 cases, and two numbers, meaning a given noun could occur in as
> many as eight different forms. The actual number of forms possible varies
> according to the declension: masculine o-stem nouns have 7 distinct forms
> (the nominative and vocative plurals are the same), the other declensions
> apparently only have four distinct forms. If there are Greek
> corpora marked
> with a part-of-speech inventory that distinguishes case and number, of
> course, all 8 possibilities could be distinguished.
>
> I presume (without any real evidence) that words in normal usage are not
> evenly distributed across these cases: for example, i'd assume
> the vocative
> singular is much less frequent, at least in news text, and the vocative
> plural very rare indeed. I presume the nominative case would be the most
> frequent, but if so, how much more frequent than the accusative
> or genitive?
>
> If you have references, unpublished findings, or even informed
> speculations
> about the distributional facts for Greek/Russian/whatever case language
> you've got, i'd appreciate hearing them.
>
> Sean Boisen
> Senior Scientist, BBN Technologies
> sboisen@bbn.com
>
>

------=_NextPart_000_0002_01BE1E05.0414E2E0
Content-Type: text/x-vcard;
name="Christopher Brewster.vcf"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="Christopher Brewster.vcf"

BEGIN:VCARD
VERSION:2.1
N:Brewster;Christopher;;;
FN:Christopher Brewster
ORG:;
TITLE:
NOTE:icq#20543378
TEL;WORK;VOICE:+30 (061) 997370
TEL;HOME;VOICE:+30 (061) 623038
TEL;WORK;FAX:
ADR;WORK:;;Mesonos 59;Patras;;GR 262 21;Greece
LABEL;WORK;ENCODING=3DQUOTED-PRINTABLE:Mesonos 59=3D0D=3D0APatras, GR =
262 21=3D0D=3D0AGreece
EMAIL;PREF;INTERNET:email: kiffer@math.upatras.gr, =
christopher.brewster@sp1.y-net.gr, x400: <C=3DGR; A=3D0; P=3DY-NET; =
O=3DSP1; S=3DBrewster; GI=3DChristopher>
REV:19981129T155146Z
END:VCARD

------=_NextPart_000_0002_01BE1E05.0414E2E0--

Next message: Yaari Yaakov: "Corpora: Brill's tagger for DOS"
Previous message: Magnar Brekke: "Re: Corpora: history of corpora"