W0019 : Dutch PAROLE Distributable Corpus

The Dutch PAROLE Distributable Corpus is a 3 million words selection from the 20 million words Dutch PAROLE Reference corpus.

The Dutch corpus annotation and checking was made accordingly to the common core PAROLE tagset. The Dutch data were also checked for type.

The Dutch PAROLE Distributable Corpus contains the following texts:

MEDIUM

SOURCE

TIMESPAN

TOTAL NUMBER
of WORDS

BOOKS

Van Sterkenburg:
Wdlijst tot wdboek
Taal vt Journaal
WNT-portret


1984
1989
1992


65,344
56,215
60,133

NEWSPAPERS

Short Newspaper texts:
MN_Collection
CVNP(S)-Collection


1986-1988
1983-1990


19,537
179,220

PERIODICAL

Short texts from
- Local Papers
- Magazines


1985-1988
1985-1989


47,019
164,589

MISCELLANEOUS

Texts to be read out in
TV-news broadcasts for:
- General audience
- Youth
Short texts from
Ephemera



1992-1995
1991-1995

1985-1986



1,285,824
1,008,658

131,692

TOTAL

   

3,018,231

Over 250,000 words of corpus texts have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked at least two times with respect to maximal granularity, according to a lexicographer's manual. The extra 9,000 words over the required 50,000 words compensate for the occurrence of ca. 5,300 "keywords" in the original texts. The fully corrected material has been subjected to an automated post-control operation, checking the pertinence relations between the various feature values, and instantiating default values in case a mismatch (indicating a correction error) was found. Ca. 200,000 words have been checked once for PoS and type. In addition to the required PoS, type was checked for reasons of quality. This material has been subjected to an automated correction procedure addressing the feature slots (positions) beyond the first two for PoS and type so as to solve discrepancies between the manually corrected PoS and type, and the possibly erroneous, automatically assigned values of the remaining slots.

More info on the Parole project.


Click here to view the prices
and browse other ressources
belonging to this category

Copyright © 1996-2001 ELRA/ELDA - Webmaster