Corpus
design:
C-ORAL-ROM delivers a Multilingual Reference corpus of spontaneous speech for the main Romance Languages (French, Italian, Portuguese and Spanish) recorded in free situations, roughly 300,000 words for each Language (Informal speech 50%, Formal speech 50%, including media and telephone conversations). The corpus design simultaneously ensures representation of spontaneous speech for each language and comparability throughout the four Romance corpora.
Final Target
of each CORPUS in the C-ORAL-ROM resource
Informal: 150.000 words – at least 74 texts
|
Family/Private context - public; - partially scripted 124.500 |
Public context +
public; - public - partially
scripted + partially scripted 25.500 |
||
|
Monologues 42.000 |
Dialogues/Conversation
82.500* |
Monologues 6.000 |
Dialogues/Conversations 19.500 |
*at
least 23.000w from conversations with more then two participants distributed
along the lines of the following matrix
·
Long texts: 10 texts of around 4500 words each (around 30 minutes each)
·
Short
texts: at
least 64 texts of around 1500 words each (around 10 minutes each)
·
Very
short texts: in dialogues or conversations in public
contexts up to 7.500 words taken from collections of very short texts (from 2
to 5 Minutes each)
·
5%
variation allowed (that is texts of about 1425 and 4275 words are allowed). No
upper limit..
Formal: 150.000 words
|
Formal in natural context +
public +
scripted or partially scripted 65.000 ( 2 or
3 samples for each gender of 3000 words average) |
Media +
public +
scripted or partially scripted 60.000 ( 2 or
3 samples for each gender of 3000 words average) |
Telephone 25.000 text
length not defined in the decisions (suggestion: 1500 words upper limit no
lower limit) |
|
political
speech |
news
(small sample) |
private
conversation: |
|
political
debate |
meteo
(small sample) |
phone
to call services (man-machine interaction) |
|
preaching |
interviews |
|
|
teaching |
reportage |
|
|
professional
explanation |
scientific
press |
|
|
conference |
sport |
|
|
bussiness |
talk
shows political debate |
|
|
law
(through media) |
talk
shows thematic discussions |
|
|
|
talk
shows culture |
|
|
|
talk
shows science |
|
Differences among speakers are not variation
parameters used for corpus comparison, but are always marked as meta textual
information as regards:
1.
Age (A: 18-25; B: 25-40; C: 40-50; D: >60),
2.
Sex (M-F),
3.
Education (1: illiterate and/or elementary
school; 2:secondary school - high school; 3: B.A. – university);
4.
Geographical
origin.
Textual
and sound format
The resource has been recorded with various types of analogue or digital
equipment and meets the objective to represent different level of acoustic
quality of spontaneous speech. All audio files have been selected and evaluated on the basis of the
possibility of a meaningful F0 analysis. Acoustic source is set in
standard non-compressed .wav files, 22050 Hz-16 bit,
The
C-ORAL-ROM Transcription format is a variant of the CHAT format (Mac Whinney,
1994) enriched with prosodic segmentation of the text in utterances and tone
units (Cresti, 2000). Prosodic segmentation is systematically determined in
each romance corpus through perceptive judgements, as a function of
perceptually relevant F0 movement (’t Hart & alii, 1992). The text format
allows a suitable utterance based text-sound alignment through the speech
software WinPitch Corpus (Deliverables
2.1) to be performed in year 2.
Texts are divided into:
a)
Heading containing a definite set of meta-textual information
b) Text lines in
orthographic transcription divided:
·
horizontally, by prosodic parsing and utterance
limits, representing terminal and non terminal prosodic breaks of the speech
continuum.
c) Dependent tiers for context information and possible tagging levels.