Corpora

Sampling

The sampling distribution should be as follows:

INFORMAL 150,000 words

(Long sample 4,500 (L) ; short sample 1,500 (S); collections of very short dialogues (M)*

Private / Family context:

-public

- partially scripted

113,000

Public context:

+ public

- partially scripted

37,000

-public

+ partially scripted

Monologues

33,000

Dialogues/Conversations 80,000 **

Monologues

6,000

Dialogues/Conversations

31,000

* up to 7500 words collections of very short dialogue in public context (where possible)

**At least 23.000 conversations with more then two participants

10 long texts and 64 short sample (or more, accordingly with the possible presence of some very short dialogue collections in the corpus ) distributed as much as possible proportionally on the four fields.

FORMAL 150,000 words

Formal in natural context

+ public

+ scripted or partially scripted

65,000

(2 or 3 sample for each gender of 3000 words average)

Media

+ public

+ scripted or partially scripted

60,000

(2 or 3 sample for each gender of 3000 words average)

Telephone

 

 

25,000

text length not defined in the decisions (suggestion: 1500 words upper limit, no bottom limit)

political speech

news (small sample)

private conversation:

political debate

weather forecast (small sample)

phone to call services

preaching

interviews

man-machine interaction

teaching

reportage

 

professional explanation

scientific press

 

conference

sport

 

business

talk shows political debate

 

law (through media)

talk shows thematic discussions

 
 

talk shows culture

 
 

talk shows science