Corpus design:

 

C-ORAL-ROM delivers a Multilingual Reference corpus of spontaneous speech for the main Romance Languages (French, Italian, Portuguese and Spanish) recorded in free situations, roughly 300,000 words for each Language (Informal speech 50%, Formal speech 50%, including media and telephone conversations). The corpus design simultaneously ensures representation of spontaneous speech for each language and comparability throughout the four Romance corpora.

 

 

 Final Target  of each CORPUS in the C-ORAL-ROM resource

 

  Informal: 150.000 words – at least 74 texts

 

 

Family/Private  context 

 

- public;

- partially scripted

 

124.500

Public context 

 

 + public;                      - public

 -  partially scripted     + partially scripted

 

25.500

 

Monologues

42.000

Dialogues/Conversation 82.500*

Monologues

6.000

Dialogues/Conversations

19.500

 

*at least 23.000w from conversations with more then two participants distributed along the lines of the following matrix

 

Text length

 

·         Long texts:        10 texts of around 4500 words each (around 30 minutes each)

·         Short texts:        at least 64 texts of around 1500 words each (around 10 minutes each)

·         Very short texts: in dialogues or conversations in public contexts up to 7.500 words taken from collections of very short texts (from 2 to 5 Minutes each)

·         5% variation allowed (that is texts of about 1425 and 4275 words are allowed). No upper limit..

 

 

Formal: 150.000 words

 

Formal in natural context

+ public

+ scripted or partially scripted

 

65.000  

( 2 or 3 samples for each gender of 3000 words average)

Media

+ public

+ scripted or partially scripted

 

60.000

( 2 or 3 samples for each gender of 3000 words average)

Telephone

 

25.000

 

text length not defined in the decisions

 (suggestion: 1500 words upper limit

no lower limit)

political speech

news (small sample)

private conversation:

political debate

meteo (small sample)

phone to call services (man-machine interaction)

preaching

interviews

 

teaching

reportage

 

professional explanation

scientific press

 

conference

sport

 

bussiness

talk shows political debate

 

law (through media)

talk shows thematic discussions

 

 

talk shows culture

 

 

talk shows science

 

 

 

 

Speaker parameters

 

Differences among speakers are not variation parameters used for corpus comparison, but are always marked as meta textual information as regards:

 

1.        Age (A: 18-25; B: 25-40; C: 40-50; D: >60),

2.        Sex (M-F),

3.        Education (1: illiterate and/or elementary school; 2:secondary school - high school; 3: B.A. – university);

4.        Geographical origin.

 

 

Textual and sound format

 

The resource has been recorded with various types of analogue or digital equipment and meets the objective to represent different level of acoustic quality of spontaneous speech. All audio files have been  selected and evaluated on the basis of the possibility of a meaningful F0 analysis. Acoustic source is set in standard non-compressed .wav files, 22050 Hz-16 bit,

 

The C-ORAL-ROM Transcription format is a variant of the CHAT format (Mac Whinney, 1994) enriched with prosodic segmentation of the text in utterances and tone units (Cresti, 2000). Prosodic segmentation is systematically determined in each romance corpus through perceptive judgements, as a function of perceptually relevant F0 movement (’t Hart & alii, 1992). The text format allows a suitable utterance based text-sound alignment through the speech software WinPitch Corpus (Deliverables  2.1) to be performed in year 2.

 

Texts are divided into:

a)       Heading containing a definite set of meta-textual information

b)    Text lines in orthographic transcription divided:

·         horizontally, by prosodic parsing and utterance limits, representing terminal and non terminal prosodic breaks of the speech continuum.

c)    Dependent tiers for context information and possible tagging levels.