C-Oral-Rom : corpora sampling

Corpora
Sampling

The sampling distribution should be as follows:

INFORMAL 150,000 words

(Long sample 4,500 (L) ; short sample 1,500 (S); collections of very short dialogues (M)*

Private / Family context:

-public

- partially scripted

113,000

Public context:

+ public

- partially scripted

37,000

-public

+ partially scripted

Monologues

33,000

Dialogues/Conversations 80,000 **

Monologues

6,000

Dialogues/Conversations

31,000

* up to 7500 words collections of very short dialogue in public context (where possible)

**At least 23.000 conversations with more then two participants

10 long texts and 64 short sample (or more, accordingly with the possible presence of some very short dialogue collections in the corpus ) distributed as much as possible proportionally on the four fields.

FORMAL 150,000 words

Formal in natural context + public + scripted or partially scripted 65,000 (2 or 3 sample for each gender of 3000 words average)	Media + public + scripted or partially scripted 60,000 (2 or 3 sample for each gender of 3000 words average)	Telephone 25,000 text length not defined in the decisions (suggestion: 1500 words upper limit, no bottom limit)
political speech	news (small sample)	private conversation:
political debate	weather forecast (small sample)	phone to call services
preaching	interviews	man-machine interaction
teaching	reportage
professional explanation	scientific press
conference	sport
business	talk shows political debate
law (through media)	talk shows thematic discussions
	talk shows culture
	talk shows science