SPEX / Dept. of Language and Speech University of Nijmegen Erasmusplein 1 NL-6525 HT Nijmegen The Netherlands SUBJECT: Validation Portuguese FDB4000 SpeechDat corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 1.0 DATE : 17/08/1998 The speech databases made within the SpeechDat project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat format and content specifications, as documented in Deliverables 1.3.1, 1.3.2 and 1.3.3 of the project. The validation results of the Portuguese Fixed Network SpeechDat database (4000 speakers) are contained in this document. This database was validated and approved by the SpeechDat Consortium. In the validation procedure we systematically check a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one. Validation results that call for attention because of deviations from the SpeechDat specifications are marked by =>. They can be easily extracted/'grepped' in this way. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 ANNOTATION FILES 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ==================================================================== 1. DOCUMENTATION - File DESIGN.DOC; & deliverables SD131 and SD132 can be handy OK - Language of doc file: English OK - Contact person: name, address, affiliation OK - Number of CDs OK, section 1 - Contents of each CD OK, section 1.3 - The directory structure of the CDs OK, section 1.3 - Description of all the items in the corpus OK, section 3 It is recommended to include the lists of credit card numbers (section 3.3.3), PIN codes (section 3.3.4), relative date expressions (section 3.4.3), carrier sentences for application words (section 3.5), city names (section 3.9.3), company names (section 3.9.4), and forname/surname combinations (section 3.9.5), in Appendices in DESIGN.DOC - Prompting . linguistic specification (and motivation) for the prompting material (in case of additional optional items) . connection of sheet items to item numbers on CD . sheet example . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions immediately after another are not allowed) OK, sections 2.3 (strategy), 8 (example sheets), 1.2 (link to database items) - Naming conventions for directories and files OK, sections 1.2 and 1.3 - Speaker recruitment OK, section 2.2 - Speaker demographics . which regions, how many of each . motivation for selection of regions . which age groups, how many of each . sexes: males, females, also children?; how many of each. . each call is made by a unique speaker OK, section 4 => A row of totals in the table in section 4.2 is missing. => Information on how many speakers did multiple calls => (and how many) is missing. - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) OK, section 5 - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich words (either of phones, biphones, triphones) OK, section 5 - Recording platform and telephone link description OK, section 2.1 - Signal characteristics (number of bits per sample; bandwidth; coding type; compression procedures) OK, section 1.1 - The format of the speech files (A-law, 8 bit, 8 kHz, uncompressed) OK, section 1.1 - The format of the annotation files (SAM label files) OK, section 1.4 - Annotation . procedure . quality assurance . character set used for annotation (transcription) (ISO-8859) . annotations symbols for non-speech acoustic events must be mentioned at least for Filled Pause, Speaker Noise, Stationary Noise, Intermittent Noise. . list of symbols used to denote word truncations, mispronunciations and not understandable speech . case sensitivity of transcriptions OK, section 2.4 - Lexicon information . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) . (Reference to) SAMPA symbols used . case sensitivity of entries (matching the transcriptions) OK, section 5 - Only one spelling of each word is allowed. Therefore a list of normalised spellings for words with alternative spellings should be included (SPELLALT.DOC). Otherwise a statement why such a list is not necessary. OK, section 2.4 - Information on test (set) specification => No information on test set - Indication of how many of the files were double checked by the producer together with percentage of detected errors OK, section 2.4 - The validation report made by SPEX (VALREP.TXT) is referred to => There is no reference to the validation report ========================================================================== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES - Directory / subdirectory conventions Format of directory tree should be \\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for SpeechDat(M) and 1 for SpeechDat is the ISO two-letter code for the language . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They correspond to the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - All text files should be in MS-DOS format ( at line ends OK - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. => A1TRNPT.SES and A1TSTPT.SES are not listed in the README.TXT - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED1EN_01. OK - A copyright statement should be present in the file COPYRIGH.TXT (root) OK - Documentation should be in \\DOC . DESIGN.DOC . TRANSCRIP.DOC (optional) . SPELLALT.DOC (optional) . SAMPALEX.PS . ISO8859<1,2,7>.PS . SUMMARY.TXT . SAMPSTAT.TXT OK Extra files are: MANUAL.HTM = manual for transcribers TRANSCR.HTM = transcription conventions TABLE.HTM = explanation of file nomenclature => MANUAL.HTM contains obsolete non-speech symbols for [fil], [spk], [sta], => [int] - The contents list (CONTENTS.LST) is in \\INDEX OK - Tables should be in \\TABLE . SPEAKER.TBL . LEXICON.TBL . REC_COND.TBL (optional) . SESSION.TBL (optional) OK, SPEAKER.TBL, SESSION.TBL and LEXICON.TBL are delivered - Index files (optional) should be in \\INDEX OK - The index files (if present) obey the nomenclature .LST where e.g. A1ENN3.LST (see below for item_code) Not used - Prompt sheet files (optional) should be in \\PROMPT Not present - All sessions indicated in the documentation SUMMARY.TXT are present on the CDs OK - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type A is for A-law O is for Orthographic label file OK - Correct item codes should be used: A1-3/6: common application words B1 : sequence of isolated digits C1 : prompt sheet number C2 : telephone number C3 : credit card number C4 : PIN code D1-3 : dates E1 : application word phrase I1 : isolated digit L1-3 : spelled words M1 : money amount N1 : natural number O1 : spontaneous name O2 : city of call/birth O3 : most frequent city name O5 : most frequent company/agency name O7 : forename & surname Q1-2 : yes/no questions S1-9 : phonetically rich sentences T1 : time of day T2 : time phrase W1-4 : phonetically rich words OK, see also section 3 - NNNN in filenames is not in conflict with BLOCK and SES numbers in pathname OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted OK - Missing items per speaker Check with documentation (SUMMARY.TXT) OK - File match: For each label file there must be one speech file and vice versa. OK - Part of the corpus is designed for training and a smaller part for testing. OK, in the INDEX directory A1TRNPT.SES and the test set A1TSTPT.SES are present. The 500 calls selected for testing are all from existing calls The remaining 3482 speakers for training exist and do not overlap with the testset. => Speaker codes were used in the .SES files instead of session numbers. => The line "SCD" in A1TRNPT.SES is illegal. - All table files, and index files should report the field names as the first row in the files using tabs as in the data records following. OK - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . corpus code (CCD:) . corpus repetition (CRP:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . speaker accent (ACC:) . orthographic transcription of uttered item (LBO:) The first line should be a header specifying the information in each record. This file must be supplied as an ASCII TAB delimited file. OK - The contents of the SUMMARY.TXT files should comprise: . The full directory name where speech and label files are to be found . the session number . a string of typically N codes. Each item present is represented by its code. If the item is missing, a '--' should appear. . recording date . recording time of first item . optional comment text . all these fields are separated by spaces . Note: The contents of the SUMMARY.TXT file are not CD-dependent OK, except that => The recording time is not of the first item but of a later one. => The format of the time field is hh:mm instead of hh:mm:ss ====================================================================== 3. ITEMS - 1 isolated digit (code I1) . read or prompted OK The prompt contains the orthographic instead of the numeric representation of the word. - 1 sequence of 10 isolated digit (code B1) . each sequence must include all digits . optional are hash and star OK, no hash or star used - 4 connected digits (code C1-4) - 4-6 digit number to identify the prompt sheet . read - ~10 digit telephone number . read . local numbers . inclusion of GSM numbers recommended - 14-16 digit credit card number . read . set of 150 . if there is a checksum then formula must be provided - 6 digit PIN code . read . set of 150 . ~30 digits per call are required . digits must appear numerically on the sheet, not as words OK C2 does not include GSM numbers. - 1 natural number (code N1) . read . provided as numbers (numerically) . numbers must be < 1,000,000 . decimal numbers only allowed for additional natural numbers OK Also numbers larger than 1,000,000 were used - 1 money amount (code M1) . read . currency words should be included . mixture of small amount including decimals and large amounts not including decimals OK Decimals in currencies were not used - 3 spelled words (code L1-3) . L1 is spontaneous name spelling linked to O1 . others are read . equal balance of all vocabulary letters artificial words can be used to enforce this balance . average length at least 7 letters . may include names, cities and other frequently spelled items . should include equivalents of : A-Z, accent words, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN OK The average length of the spelt words is 6.9 characters. L2 is linked to O2 - 1 time of day (code T1) . spontaneous OK - 1 time phrase (code T2) . read . analogue form . equal balance of all words . should include equivalents of : AM/PM, HALF/QUARTER PAST/TO, NOON, MIDNIGHT, MORNING, AFTERNOON, EVENING, NIGHT, TODAY, YESTERDAY, TOMORROW OK - 1 date (code D1) . spontaneous OK - 1 date (code D2) . read, wordstyle . analogue form . covering all weekdays and months, ordinals and year expressions (also exceeding 2000) OK Each weekday is represented by > 500 samples Each month is represented by > 300 samples Each year is represented by > 150 samples - 1 relative date (code D3) . read . analogue . should include forms such as TODAY, TOMORROW, THE DAY AFTER TOMORROW, THE NEXT DAY, THE DAY AFTER THAT, NEXT WEEK, GOOD FRIDAY, EASTER MONDAY, etc. OK - 2 yes/no questions (code Q1-2) . spontaneous, not prompted . one question should elicit (predominantly) 'no' answers; the other (predominantly) 'yes' answers . also fuzzy answers should be envisaged OK - 3/6 common application words (code A1-3/6) . read . set of 30 should be used, 25 of which are fixed for all . minimum number of examples of each word = #speakers/10 . 6 are needed, but only 3 for 4000+ FDBs OK, all 25 fixed words are recorded (recall and redial are assumed to be the same functions). All 30 words have about 400 realisations. - 1 application word phrase (code E1) . application word is embedded in phrase . read or spontaneous OK - 9 phonetically rich sentences (code S1-9) . read . minimum number of phone examples = #speakers/10 OK, The frequency of /a~/ is 9, but this is not a regular SAMPA symbol (see section 6). - 4 phonetically rich words (code W1-4) . read . minimum number of phone examples = #speakers/5 OK, => The frequency of /u~/ is 330, which is somewhat less than 400. - 5 directory assistance names (code O1-7) . 1 spontaneous name (e.g. forename) . 1 spontaneous city name . 1 read city name (from list of 500 most frequent) . 1 read company/agency name (from list of 500 most frequent) . 1 read proper name, fore- and surname (from list of 150 SDB names) OK The following completeness checks are performed on obligatory SpeechDat items only: 1. Structurally missing items All obligatory items are recorded. There are no additional, optional items. 2. Incidentally missing items a. files that are not there No missing files were found b. files with empty transcriptions in the LBO label field (effectively missing files) We found 490 files that have only noise symbols and/or ** in their transcriptions. The distribution of these files over the items is: 15 A1 12 A2 5 A3 4 B1 4 C1 14 C2 13 C3 9 C4 10 D1 4 D2 2 D3 4 E1 12 I1 48 L1 11 L2 14 L3 10 M1 7 N1 36 O1 35 O2 3 O3 41 O5 2 O7 17 Q1 27 Q2 5 S1 10 S2 4 S3 7 S4 4 S5 5 S6 3 S7 13 S8 12 S9 52 T1 7 T2 7 W1 8 W2 10 W3 10 W4 c. corrupted speech files If we regard utterances which have only truncated or mispronounced words as corrupted files, and merge these with the effectively missing files under b. then the following distribution emerges : 63 A1 63 A2 46 A3 4 B1 4 C1 14 C2 13 C3 9 C4 10 D1 4 D2 9 D3 4 E1 56 I1 50 L1 11 L2 14 L3 11 M1 7 N1 136 O1 64 O2 61 O3 169 O5 2 O7 46 Q1 59 Q2 6 S1 10 S2 4 S3 7 S4 4 S5 5 S6 4 S7 13 S8 12 S9 52 T1 8 T2 140 W1 133 W2 104 W3 114 W4 (This will not be used to reject or approve a database but it will be supplied as supplementary information.) d. files containing truncation and mispronunciation marks (*,**,~ are counted in the transcriptions of the individual items to get an idea of distorted speech data. This will not be used to reject or approve a database but it will be supplied as supplementary information.) We found 7804 transcriptions with at least one *, or **, or ~, according to the following distribution over the items: A1: 111 A2: 147 A3: 85 B1: 56 C1: 27 C2: 114 C3: 167 C4: 62 D1: 130 D2: 189 D3: 102 E1: 173 I1: 102 L1: 157 L2: 71 L3: 98 M1: 106 N1: 165 O1: 382 O2: 147 O3: 110 O5: 226 O7: 105 Q1: 103 Q2: 116 S1: 308 S2: 410 S3: 265 S4: 567 S5: 181 S6: 488 S7: 324 S8: 418 S9: 673 T1: 164 T2: 81 W1: 198 W2: 182 W3: 141 W4: 153 3. Overall conclusion SpeechDat has the following criteria for missing items: . At least 95% of the files of each mandatory item (corpus code) must be present. . As missing files are counted: absent files, and files containing non-speech events only. . There will be no further comparison of prompt and transcription text in order to decide if a file is effectively missing. As a consequence: If there is some speech in the transcription, then the file will NOT be considered missing, even if it is in fact useless. For a database of 4000 calls a maximum of 5% * 4000 = 200 files per item may be missing. For the decision of completeness of an item the distribution given in 2b above should be used. It is clear from this distribution that none of the items effectively misses 5% or more out of all realisations. This even holds if the recordings with corrupted speech only are included. Thus, it appears that all items are sufficiently complete. =========================================================================== 4. SAMPLED DATA FILES 1 Coding . A-law, 8 bit, 8 kHz, no compression OK 2 Sample distribution Several sample statistics are generated: File length, clipping rate, mean sample value, Signal-to-Noise Ratio (SNR). Statistics were generated on file level by the producer of the database, using SPEX software. The results were delivered to SPEX. SPEX compiled histograms on the basis these results. These histograms are presented below, both on file level and on directory (call) level. The histograms are presented as they are and not further interpreted by SPEX. On the basis of these data the user of the database should be able to decide which acoustic quality is still acceptable for the application at hand. Statistics on the acoustics of individual speech files can be retrieved from file \DOC\SAMPSTAT.TXT. The columns in SAMPSTAT.TXT have the following meaning: file max min #samples cliprate mean snr A11001C2.ENA:16384:-13056:80000: 0.00: -4.28: 35.89 2.1 File length We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length. Duration distribution over all items: Length (s) #Occurrences 0 - 1 : 6396 1 - 2 : 43574 2 - 3 : 22439 3 - 4 : 21634 4 - 5 : 19320 5 - 6 : 13838 6 - 7 : 9802 7 - 8 : 6981 8 - 9 : 4885 9 - 10 : 3532 10 - 11 : 2170 11 - 12 : 1239 12 - 13 : 882 13 - 14 : 723 14 - 15 : 2089 15 - 16 : 163 16 - 17 : 119 17 - 18 : 147 18 - 19 : 180 19 - 20 : 957 20 - 21 : 8 Duration distribution over calls/directories: Length (s) #Occurrences 2 - 3 : 349 3 - 4 : 2261 4 - 5 : 956 5 - 6 : 232 6 - 7 : 66 7 - 8 : 33 8 - 9 : 19 9 - 10 : 13 10 - 11 : 12 11 - 12 : 13 12 - 13 : 5 13 - 14 : 6 14 - 15 : 8 15 - 16 : 54 The number of calls with an average length between 15-16 secs (54 sessions) is striking. Probably these have a high background noise level. 2.2 min-max samples We provide a histogram with clipping ratios, The clipping ratio is defined as the proportion of samples in a file that is equal to the maximum/minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Clip distribution for all items: Clipping Occurrences rate (in %) 0.0 - 0.1 : 55000 0.1 - 0.2 : 30545 0.2 - 0.3 : 19374 0.3 - 0.4 : 9348 0.4 - 0.5 : 6467 0.5 - 0.6 : 3805 0.6 - 0.7 : 1929 0.7 - 0.8 : 943 0.8 - 0.9 : 572 0.9 - 1.0 : 258 1.0 - 1.1 : 125 1.1 - 1.2 : 56 1.2 - 1.3 : 30 1.3 - 1.4 : 11 1.4 - 1.5 : 8 1.5 - 1.6 : 1 1.6 - 1.7 : 2 2.1 - 2.2 : 1 Number of files with absolute maximum < 32256: 32603 Clip distribution over calls/directories: Clipping Occurrences rate (in %) 0.0 - 0.1 : 1741 0.1 - 0.2 : 1202 0.2 - 0.3 : 626 0.3 - 0.4 : 279 0.4 - 0.5 : 85 0.5 - 0.6 : 25 0.6 - 0.7 : 6 0.7 - 0.8 : 2 Number of directories with absolute maximum < 32256: 61 There is no call with an average clipping rate above 1.0%. 2.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. Mean distribution over all items: Mean Occurrences -2250 - -2240 : 1 -2210 - -2200 : 1 -2130 - -2120 : 1 -1940 - -1930 : 2 -1930 - -1920 : 9 -1920 - -1910 : 11 -1910 - -1900 : 4 -1900 - -1890 : 3 -1890 - -1880 : 5 -1880 - -1870 : 1 -1870 - -1860 : 3 -1850 - -1840 : 2 -1820 - -1810 : 1 -1560 - -1550 : 1 -1520 - -1510 : 1 -1480 - -1470 : 1 -1330 - -1320 : 1 -1210 - -1200 : 1 -1100 - -1090 : 2 -910 - -900 : 2 -890 - -880 : 1 -880 - -870 : 1 -850 - -840 : 1 -810 - -800 : 1 -800 - -790 : 1 -730 - -720 : 1 -680 - -670 : 1 -670 - -660 : 74 -660 - -650 : 190 -650 - -640 : 16 -640 - -630 : 56 -630 - -620 : 37 -620 - -610 : 31 -610 - -600 : 12 -600 - -590 : 5 -590 - -580 : 2 -580 - -570 : 10 -570 - -560 : 39 -560 - -550 : 4 -550 - -540 : 2 -540 - -530 : 70 -530 - -520 : 44 -520 - -510 : 39 -510 - -500 : 31 -500 - -490 : 26 -490 - -480 : 122 -480 - -470 : 96 -470 - -460 : 99 -460 - -450 : 110 -450 - -440 : 110 -440 - -430 : 140 -430 - -420 : 86 -420 - -410 : 116 -410 - -400 : 128 -400 - -390 : 170 -390 - -380 : 128 -380 - -370 : 90 -370 - -360 : 86 -360 - -350 : 48 -350 - -340 : 84 -340 - -330 : 115 -330 - -320 : 138 -320 - -310 : 70 -310 - -300 : 87 -300 - -290 : 179 -290 - -280 : 161 -280 - -270 : 117 -270 - -260 : 188 -260 - -250 : 206 -250 - -240 : 198 -240 - -230 : 184 -230 - -220 : 129 -220 - -210 : 136 -210 - -200 : 138 -200 - -190 : 190 -190 - -180 : 236 -180 - -170 : 264 -170 - -160 : 240 -160 - -150 : 267 -150 - -140 : 266 -140 - -130 : 362 -130 - -120 : 352 -120 - -110 : 529 -110 - -100 : 498 -100 - -90 : 521 -90 - -80 : 658 -80 - -70 : 817 -70 - -60 : 908 -60 - -50 : 1014 -50 - -40 : 1206 -40 - -30 : 1697 -30 - -20 : 2183 -20 - -10 : 3339 -10 - 0 : 5217 0 - 10 : 8258 10 - 20 : 17157 20 - 30 : 26636 30 - 40 : 45032 40 - 50 : 28038 50 - 60 : 6613 60 - 70 : 1814 70 - 80 : 764 80 - 90 : 397 90 - 100 : 267 100 - 110 : 200 110 - 120 : 136 120 - 130 : 102 130 - 140 : 81 140 - 150 : 71 150 - 160 : 50 160 - 170 : 51 170 - 180 : 28 180 - 190 : 30 190 - 200 : 26 200 - 210 : 17 210 - 220 : 21 220 - 230 : 16 230 - 240 : 14 240 - 250 : 11 250 - 260 : 9 260 - 270 : 6 270 - 280 : 9 280 - 290 : 12 290 - 300 : 10 310 - 320 : 7 320 - 330 : 3 330 - 340 : 3 340 - 350 : 1 350 - 360 : 2 360 - 370 : 1 370 - 380 : 2 400 - 410 : 1 420 - 430 : 3 460 - 470 : 4 470 - 480 : 2 490 - 500 : 1 530 - 540 : 1 Mean distribution over calls/directories: Mean Occurrences -1910 - -1900 : 1 -670 - -660 : 1 -660 - -650 : 6 -640 - -630 : 1 -630 - -620 : 1 -620 - -610 : 1 -580 - -570 : 1 -570 - -560 : 1 -540 - -530 : 2 -520 - -510 : 2 -490 - -480 : 3 -480 - -470 : 4 -460 - -450 : 2 -450 - -440 : 7 -440 - -430 : 3 -430 - -420 : 1 -420 - -410 : 3 -410 - -400 : 2 -400 - -390 : 4 -390 - -380 : 3 -380 - -370 : 3 -370 - -360 : 2 -350 - -340 : 1 -340 - -330 : 2 -330 - -320 : 5 -300 - -290 : 5 -290 - -280 : 4 -280 - -270 : 1 -270 - -260 : 5 -260 - -250 : 9 -250 - -240 : 3 -240 - -230 : 5 -230 - -220 : 2 -220 - -210 : 4 -210 - -200 : 4 -200 - -190 : 5 -190 - -180 : 6 -180 - -170 : 6 -170 - -160 : 3 -160 - -150 : 7 -150 - -140 : 11 -140 - -130 : 10 -130 - -120 : 8 -120 - -110 : 10 -110 - -100 : 14 -100 - -90 : 13 -90 - -80 : 26 -80 - -70 : 18 -70 - -60 : 17 -60 - -50 : 30 -50 - -40 : 30 -40 - -30 : 34 -30 - -20 : 31 -20 - -10 : 82 -10 - 0 : 128 0 - 10 : 198 10 - 20 : 408 20 - 30 : 737 30 - 40 : 1274 40 - 50 : 660 50 - 60 : 103 60 - 70 : 23 70 - 80 : 9 80 - 90 : 5 90 - 100 : 2 100 - 110 : 4 110 - 120 : 4 120 - 130 : 2 130 - 140 : 1 150 - 160 : 2 160 - 170 : 1 180 - 190 : 1 There is a bias for the average sample to be positive instead of 0. The call with the very low average sample value (-1903.8) is session 3095. But this call is OK. 2.4 Signal to Noise Ratio We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. The mean sample value over the complete file was subtracted from each individual sample value before MS was computed. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. SNR distribution over all items: SNR occurrences 0 - 5 : 93 5 - 10 : 361 10 - 15 : 1023 15 - 20 : 2054 20 - 25 : 4835 25 - 30 : 14624 30 - 35 : 33868 35 - 40 : 56149 40 - 45 : 41975 45 - 50 : 5591 50 - 55 : 401 55 - 60 : 66 60 - 65 : 18 65 - 70 : 12 70 - 75 : 3 75 - 80 : 2 80 - 85 : 2 85 - 90 : 1 SNR distribution over calls/directories: SNR occurrences 0 - 5 : 1 5 - 10 : 7 10 - 15 : 20 15 - 20 : 49 20 - 25 : 92 25 - 30 : 350 30 - 35 : 847 35 - 40 : 1566 40 - 45 : 1018 45 - 50 : 74 50 - 55 : 3 The call with the very low average SNR of 2.5 dB was session 2910. This call is OK, but has enormous silent partions. The calls with average SNR between 5-10 dB are sessions: 0062: OK, but heavy buzz and long files 0571: OK, but heavy buzz and long files => 2486: OK, but background noise is very heavy. This call is unsuited for => training purposes 3095: OK, long files => 4729: OK, but background noise is very heavy. This call is unsuited for => training purposes => 4758: OK, but background noise is very heavy. This call is unsuited for => training purposes => 4771: OK, but background noise is very heavy. This call is unsuited for => training purposes =========================================================================== 5. ANNOTATION FILE - Each line must be delimited by OK - Mandatory (SAM) mnemonics: LHD: SAM, 5.10 DBN: SPEECHDAT__Fixed_Network VOL: FIXED1_ SES: DIR: SRC: CCD: CRP: < = corpus repetition, empty> REP: RED: RET: SAM: 8000 < = sampling freq.> BEG: END: SNB: 1 < = number of bytes per sample> SBF: < = sample byte order, meaningless with single bytes> SSB: 8 < = number of significant bits per sample> QNT: A-LAW < = quantisation> SCD: SEX: M/F/UNKNOWN AGE: ! mnemo is not SAM ACC: ! mnemo is not SAM REG: ENV: LBD: LBR: , , [gain], [minimum value], [maximum value], LBO: , [centre sample], , EXT: 80 chars on one line> ELF: - Optional (SAM) mnemonics (may be omitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 < = number of channels recorded> ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM CMP: EXP: SYS: DAT: SPA: PHM: ! mnemo is not SAM NET: PSTN < = network> ! mnemo is not SAM DSC: < = discontinuity marker> EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM HLT: TRD: RCC: ASS: ! mnemo is not SAM - Order restrictions: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword OK - All mnemonics should be SAM mnemonics or explicitly defined in documentation OK => one minor error was found: A11421O2.PTO has an empty line at file end The following optional (additional) mnemonics are used: SHT, ASS - No illegal mnemonics used OK - There are no mnemonics missing OK - All files must contain the same mnemonics. This holds as well for the optional mnemonics. OK - No illegal field values should appear => For the files in session 4000 "9/Feb/1998" is used for RED, => instead of "09/Feb/1998" For ENV the value "HOME/OFFICE" was used, which should have been separate qualifications. But this is not a problem. => Session 1820 mentions two ages: "13" and "unknown" - No line may exceed 80 chars => In one file the LBO line was 85 characters long: A10598S7.PTO - Each lowest subdirectory does not refer to multiple sheet ids. OK - For spontaneous speech LBR should contain a mnemonic word. D1 : L1 : O1 : O2 : Q1 : or Q2 : or T1 :