Corpus design: Final Target of each CORPUS in the C-ORAL-ROM resource

Corpus design:

C-ORAL-ROM delivers a Multilingual Reference corpus of spontaneous speech for the main Romance Languages (French, Italian, Portuguese and Spanish) recorded in free situations, roughly 300,000 words for each Language (Informal speech 50%, Formal speech 50%, including media and telephone conversations). The corpus design simultaneously ensures representation of spontaneous speech for each language and comparability throughout the four Romance corpora.

Final Target of each CORPUS in the C-ORAL-ROM resource

Informal: 150.000 words – at least 74 texts

Family/Private context

- public;

- partially scripted

124.500

Public context

+ public; - public

- partially scripted + partially scripted

25.500

Monologues

42.000

Dialogues/Conversation 82.500*

Monologues

6.000

Dialogues/Conversations

19.500

*at least 23.000w from conversations with more then two participants distributed along the lines of the following matrix

Text length

· Long texts: 10 texts of around 4500 words each (around 30 minutes each)

· Short texts: at least 64 texts of around 1500 words each (around 10 minutes each)

· Very short texts: in dialogues or conversations in public contexts up to 7.500 words taken from collections of very short texts (from 2 to 5 Minutes each)

· 5% variation allowed (that is texts of about 1425 and 4275 words are allowed). No upper limit..

Formal: 150.000 words

*Formal in natural context* + public + scripted or partially scripted 65.000 ( 2 or 3 samples for each gender of 3000 words average)	*Media* + public + scripted or partially scripted 60.000 ( 2 or 3 samples for each gender of 3000 words average)	*Telephone* 25.000 text length not defined in the decisions (suggestion: 1500 words upper limit no lower limit)
political speech	news (small sample)	private conversation:
political debate	meteo (small sample)	phone to call services (man-machine interaction)
preaching	interviews
teaching	reportage
professional explanation	scientific press
conference	sport
bussiness	talk shows political debate
law (through media)	talk shows thematic discussions
	talk shows culture
	talk shows science

Speaker parameters

Differences among speakers are not variation parameters used for corpus comparison, but are always marked as meta textual information as regards:

1. Age (A: 18-25; B: 25-40; C: 40-50; D: >60),

2. Sex (M-F),

3. Education (1: illiterate and/or elementary school; 2:secondary school - high school; 3: B.A. – university);

4. Geographical origin.

Textual and sound format

The resource has been recorded with various types of analogue or digital equipment and meets the objective to represent different level of acoustic quality of spontaneous speech. All audio files have been selected and evaluated on the basis of the possibility of a meaningful F₀analysis. Acoustic source is set in standard non-compressed .wav files, 22050 Hz-16 bit,

The C-ORAL-ROM Transcription format is a variant of the CHAT format (Mac Whinney, 1994) enriched with prosodic segmentation of the text in utterances and tone units (Cresti, 2000). Prosodic segmentation is systematically determined in each romance corpus through perceptive judgements, as a function of perceptually relevant F0 movement (’t Hart & alii, 1992). The text format allows a suitable utterance based text-sound alignment through the speech software WinPitch Corpus (Deliverables 2.1) to be performed in year 2.

Texts are divided into:

a) Heading containing a definite set of meta-textual information

b) Text lines in orthographic transcription divided:

vertically, in dialogic turns (introduced by a speaker label)

· horizontally, by prosodic parsing and utterance limits, representing terminal and non terminal prosodic breaks of the speech continuum.

c) Dependent tiers for context information and possible tagging levels.