Corpora: ELRA News

Valerie Mapelli (mapelli@elda.fr)
Wed, 28 Jul 1999 16:39:12 +0200

[ We apologise for the duplicate posting of this announcement ]

___________________________________________________________
ELRA
European Language Resources Association
ELRA News
___________________________________________________________

*** ELRA NEW RESOURCES ***
*** Dutch PAROLE Corpus and Lexicon ***

We are happy to announce the availability of the Dutch PAROLE resources via
ELRA:

1) INTRODUCTION ON THE PAROLE PROJECT

LE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised
set of
"core" corpora and lexica for all European Union languages.

Language corpora and lexica were built according to the same design and
composition principles, in the period 1996-1998.

More details on the PAROLE project at:
http://www.icp.grenet.fr/ELRA/home.html
http://www.linglink.lu/le/projects/le-parole/index.html
(on the Dutch PAROLE corpus and lexicon, see: http://www.inl.nl)
___________________________________________________________

2) ELRA-W0019 Dutch PAROLE Distributable Corpus

The Dutch PAROLE Distributable Corpus is a 3 million words selection from the
20 million words Dutch PAROLE Reference corpus

The Dutch corpus annotation and checking was made accordingly to the common
core PAROLE tagset. The Dutch data were also checked for type.

The Dutch PAROLE Distributable Corpus contains the following texts:
MEDIUM SOURCE TIMESPAN TOTAL NUMBER
of WORDS
BOOKS
Van Sterkenburg:
Wdlijst tot wdboek 1984 65,344
Taal vt Journaal 1989 56,215
WNT-portret 1992 60,133

NEWSPAPERS
Short Newspaper texts:
MN_Collection 1986-1988 19,537
CVNP(S)-Collection 1983-1990 179,220

PERIODICAL Short texts from
- Local Papers 1985-1988 47,019
- Magazines 1985-1989 164,589

MISCELLANEOUS
Texts to be read out in
TV-news broadcasts for:
- General audience 1992-1995 1,285,824
- Youth 1991-1995 1,008,658
Short texts from
Ephemera 1985-1986 131,692

TOTAL 3,018,231

Over 250,000 words of corpus texts have been PoS-tagged automatically. A
total
of 59,798 running words has been manually corrected and checked at least two
times with respect to maximal granularity, according to a lexicographer’s
manual. The extra 9,000 words over the required 50,000 words compensate for
the
occurrence of ca. 5,300 ‘keywords’ in the original texts. The fully corrected
material has been subjected to an automated post-control operation, checking
the pertinence relations between the various feature values, and instantiating
default values in case a mismatch (indicating a correction error) was found.
Ca. 200,000 words have been checked once for PoS and type. In addition to the
required PoS, type was checked for reasons of quality. This material has been
subjected to an automated correction procedure addressing the feature slots
(positions) beyond the first two for PoS and type so as to solve discrepancies
between the manually corrected PoS and type, and the possibly erroneous,
automatically assigned values of the remaining slots.

Special price for academic users from the Netherlands and Belgium: 150 EURO
(the data will be supplied directly by the Instituut voor Nederlandse
Lexicologie, http://www.inl.nl)

Price for ELRA members
For academic use: 270 EURO
For research use by a commercial organisation: 800 EURO
For commercial use: 1600 EURO

Price for non members
For academic use: 300 EURO
For research use by a commercial organisation: 1300 EURO
For commercial use: 2500 EURO

___________________________________________________________

3) ELRA-L0031 Dutch PAROLE lexicon

The entry list of the lexicon consists of about 20,200 entries distributed
over
13 parts of speech (POS). The entries have been described along the dimensions
of morphosyntax and syntax. Morphosyntactic information consists of various
lexical properties, like gender, number, case, person, inflection, etc.
Syntactic descriptions consist of typical complementation patterns associated
with the various lemmata.

The composition of the entry list of the lexicon is based on 3 corpora from
the
Instituut voor Nederlandse Lexicologie (INL) and 2 lexica. The corpora contain
a total of about 54 million words and have been automatically annotated for
part-of-speech and lemma. The lexica contain morphosyntactic information of
various kinds. For verbs, nouns, adjectives and adverbs, lemmata that were
covered by at least 2 corpora and the 2 lexica were selected on the basis of
cumulative frequency, coverage (distribution over sources) and inflected
forms.
For the smaller parts of speech, these selection requirements appeared to be
too strict. Entry selection for these parts of speech was based on ranked
frequency.

The entries, uniquely defined by the combination of part of speech (e.g. noun)
and subtype (e.g. common vs. proper noun), are provided with morphosyntactic
information according to the Dutch set of PAROLE categories and features, and,
where available, with syntactic information. Morphosyntactic information is
automatically extracted from the INL lexica. Syntactic data have been
collected
manually, by inspection of corpus data and - where necessary - consultation of
reference works. The corpus consulted consists of the newspaper component and
the varied component of the 38 Million Words Corpus 1996.

Word forms in the Dutch PAROLE lexicon are not inflected according to general
paradigms, but are related to their lemma by a set of string procedures. These
procedures are not unique. They can be shared by many other word forms. An
example is suffixation with e for adjectives, which produces ‘goede’/good from
‘goed’. Inflected forms can be derived directly by applying the string
procedures to the lemma they are connected with.

The lexicon is set up as an SGML file (over 30 MB of plain ASCII). Its
contents
have been encoded in a distributed manner: all formative entities (like
lemmata, syntactic phrases, feature bundles) are SGML entities, related by a
pointer mechanism to other entities.

The lexicon contains the following categories: adjectives (3,298 entries),
adpositions (80 entries), adverbs (554 entries), articles (3 entries),
conjunctions (70 entries), determiners (59 entries), interjections (235
entries), nouns (12,279 entries), numerals (77 entries), pronouns (85
entries),
residuals (186 entries), unique (1 entry), verb (3,274 entries).

Special price for academic users from the Netherlands and Belgium: 200 EURO
(the data will be supplied directly by the Instituut voor Nederlandse
Lexicologie, http://www.inl.nl)

Price for ELRA members
For academic use: 300 EURO
For research use by a commercial organisation: 1600 EURO
For commercial use: 8000 EURO

Price for non members
For academic use: 400 EURO
For research use by a commercial organisation: 3000 EURO
For commercial use: 10000 EURO
___________________________________________________________

In case of potential cooperation between a user and the Instituut voor
Nederlandse Lexicologie with mutual revenues, specific conditions will apply.

Nota: The prices of the Dutch PAROLE corpus and lexicon have been amended
since
their publication in the last ELRA Newsletter Vol.4 N.2

=====================================
For further information, please contact :

ELRA/ELDA Tel : +33 01 43 13 33 33
55-57 rue Brillat-Savarin Fax : +33 01 43 13 33 30
F-75013 Paris, France E-mail : mapelli@elda.fr

or visit our Web site:

http://www.icp.grenet.fr/ELRA/home.html
=====================================