TC-STAR 2007 ASR EVALUATION PLAN (08-Jan-07; updated on 15-Jan-2007) 1. Introduction This document defines the ASR tasks, the evaluation conditions, the development and test corpora, the scoring procedures, and the agenda for the 2007 TC-STAR ASR evaluation. The 2007 ASR evaluation is done in conjunction with the 2007 TC-STAR SLT evaluation and takes into account the SLT requirements. The 2007 evaluation plan supports ASR evaluation for three languages and two tasks: - Parliament Sessions for English and Spanish - Broadcast News for Mandarin Chinese -ASR English systems will be evaluated on recordings of the European Parliament Plenary Sessions (EPPS). -ASR Spanish systems will be evaluated on EPPS data and on recordings of the Spanish National Parliament (CORTES) as well. -ASR Mandarin Chinese systems will be evaluated on broadcast emissions of the Voice of America radio. Participants must submit a complete result for at least one of the three tasks (EPPS English, EPPS/Cortes Spanish, and BN Mandarin). All the segments of a given test set (there are 3 test sets) must be entirely processed using the same system without any manual intervention, i.e. the same system and the same tuning musty be used for the Spanish EPPS data and Cortes data. Participants can submit as many contrastive results as they like. For the EPPS tasks, participants are strongly encouraged to submit systems trained on the restricted set of corpora. For English and Spanish, word error rates will be computed in the following conditions: - case sensitive with punctuation - case sensitive without punctuation - case insensitive without punctuation (primary scoring) Case sensitive and punctuation are not mandatory but recommended. For Mandarin Chinese, Character error rate will be computed with and without punctuation. 2. The ASR tasks and languages The TC-STAR'07 evaluation will include data sets in English, Spanish and Mandarin. Two tasks are supported for this evaluation, namely the Parliament Session task for English (EPPS) and Spanish (EPPS and Spanish Parliament) and the Broadcast News task (BN) for Mandarin. The following three test conditions will be used for the TC-STAR'07 evaluation: - European Parliament Plenary Session in English (EPPS_EN), 3h test set from June to September 2006, with only original speeches(i.e. not the translator's speech). - European Parliament Plenary Session in Spanish (EPPS_SP)provided by ELDA, 3h test set from June to September 2006. with only original speeches. - National Parliament Sessions in Spanish (PARL_SP) provided by ELDA, 3h test set from June 2006 - Broadcast news Mandarin (BN_MAN) with automatic segmentation, about 3h of VOA taken from the LDC TDT3 corpus from December 1998 Participants can build systems for any processing speed, so there are no specific speed categories. However participants must report the total time (wall to wall elapsed time) needed to process the data for each submitted system. The real-time factor will be included in the tabulated results along with the word/character error rates. 3. Processing rules The evaluated systems must be fully automatic requiring no manual interventions that have an impact on the system output. Systems will be provided with audio files (16kHz sampling rate, 16 bit samples, mono) using a standard format. Unless specified in this document, no other information about the test data can be used. Supervised model adaptation on the test data is not allowed. Data material (audio, texts, etc.) generated after the training cut-off date (or during the blackout period) cannot be used for system training or development (see evaluation schedule) with the exception of the official development data. For broadcast news, the show identity and the broadcast date are allowable side information that systems may use. For the English and Spanish, all training material must predate MAY 31st, 2006. For the Mandarin broadcast news task, developers can not use any training data from the month of DECEMBER 1998. No manual segmentations will be provided. A NIST Unpartitionned Evaluation Map file will be provided for EPPS English and EPPS/CORTES Spanish. Participants can use data collected within TC-STAR and any publicly available data (essentially from LDC and ELDA) predating the training cut-off date (see schedule below) for system development. They can also use any data they may find to be useful under the condition that this data predates the cut-off date (or doesn't fall in the blackout period for BN in Mandarin) and that they also submit results for a system using only TC-STAR and/or public data. In all cases, the training data should be fully and unambiguously documented in the system description. It follows that for the EPPS tasks we can distinguish three training conditions: 1) a restricted training condition for which systems must be trained only on data collected within the TC-STAR project and listed in the next section. 2) a public data condition for which systems can be trained on any publicly available data 3) an open condition where the only constraint concerns the cut-off date of the training data. Participants must submit a complete result for at least one of the three languages (English, Spanish and Mandarin). They can submit as many contrastive results as they like. For the Parliament tasks, participants are strongly encouraged to submit systems trained on the restricted set of corpora. 4. Data for the 2007 evaluation There is no new development data for 2007. Therefore previous development and evaluation data (dev06, eval06, dev05, eval05) can be used for system development. The following table summarizes the main attributes of the development and evaluation data for the TC-STAR 2007 evaluation. +==========+========+=======+==============+======+===========+ | Language |DataType| Domain| Epoch |Amount| Delivery | +==========+========+=======+==============+======+===========+ | English | Dev | EPPS | Oct04;Nov04; | | | | | | | Jun05;Sept05 | 12h | available | +----------+--------+-------+--------------+------+-----------+ | Spanish | | | Oct04;Nov04; | | | | | Dev | EPPS | Jun05;Sept05;| 12h | available | | | | | Oct05;Nov05 | | | +----------+----------------+--------------+------+-----------+ | Spanish | Dev | PARL | Dec04;Nov05 | 6h | available | +----------+----------------+--------------+------+-----------+ | Mandarin | Dev | BN | Dec 1998 | 12h | available | +==========+========+=======+==============+======+===========+ | English | Eval | EPPS | Jun06-Sept06 | 3h | 21 Jan 07 | +----------+----------------+--------------+------+-----------+ | Spanish | Eval | EPPS | Jun06-Sept06 | 3h | 21 Jan 07 | +----------+----------------+--------------+------+-----------+ | Spanish | Eval | PARL | Jun06-Sept06 | 3h | 21 Jan 07 | +----------+----------------+--------------+------+-----------+ | Mandarin | Eval | BN | Dec 98 | 3h | 21 Jan 07 | +==========+========+=======+==============+======+===========+ The 2006 EPPS development and test data comprise ONLY original speeches and NOT the translated speech from interpreters. The 2005 EPPS development and test data includes interpreters and politicians speeches. The same kind of data as the 2006 data will also be used to evaluate the systems, so the 2007 test data includes ONLY politicians speeches. The following table contains a non-exhaustive list of corpora that participants may want to use to train their acoustic models: - English TC-STAR EPPS, about 100h of transcribed data, (May 2004 - Jan 2005) - English TC-STAR EPPS, about 200h of untranscribed data, (Jan 2005 - May 2006) - English LDC 1995 (CSR-IV Hub 4 Marketplace LDC96S31), 1996, 1997, official NIST Hub4 training sets, LDC97S44 and LDC98S71, USC Marketplace Broadcast News Speech (LDC99S82) - English LDC TDT2 and TDT3 data with closed-captions, about 2000h, LDC99S84 and LDC2001S94 - English LDC Switchboard 1, 2-I, 2-II, 2-III, LDC97S62, LDC98S75, LDC99S79 - English LDC Callhome, LDC97S42, LDC2004S05, LDC2004S09 - English LDC Meeting corpora, ICSI LDC2004S02, ISL LDC2004S05, NIST LDC2004S09 - Spanish TC-STAR EPPS, about 60h of transcribed data (May 2004 - Jan 2005) - Spanish TC-STAR EPPS, about 200h of untranscribed data (Jan 2005 - May 2006) - Spanish TC-STAR CORTES, about 40h of transcribed data - Spanish LDC 1997, BN speech (Hub4-NE), LDC98S74 - Spanish LDC CallHome, LDC96S35 - Mandarin LDC 1997, BN speech (Hub4-NE), about 30h of transcribed data, LDC98S73 - Mandarin TDT2 and TDT3 data with quick transcriptions, LDC2001S93 and LDC2001S95 The following corpora can be used for language model development: - All transcriptions (detailed, quick or CC) of the above mentioned audio corpora - English EPPS final transcriptions, about 36M words (from parallel texts) - English UK Parliament text corpus, about 40M words (HANSARD) - English LDC NAB text corpus - English LDC Gigaword (over 1 billion words) - Spanish EPPS final transcriptions, about 36M words (from parallel texts) - Mandarin LDC news text, about 250 million GB-encoded text characters - Mandarin LDC Gigaword, about 1.1 billion words For more complete listings of possible corpora, participants are referred to the LDC and ELRA catalogs. For the EPPS tasks, participants are encouraged to submit systems trained only a restricted set of training corpora including: - English TC-STAR EPPS, about 100h of transcribed data and 200h of untranscribed data - English EPPS final transcriptions, about 36M words (from parallel texts) - Spanish TC-STAR EPPS, about 60h of transcribed data and 200h of untranscrided data - Spanish TC-STAR CORTES, about 40h of transcribed data - Spanish EPPS final transcriptions, about 36M words (from parallel texts) - Spanish parliament transcription from 1979 to October 15h 2004 More information is available here: http://www.elda.org/en/proj/tcstar-wp4/tcs-asr-run3.htm 5. System outputs For each input audio file the ASR hypothesis are to be formatted in a NIST CTM file, i.e. the concatenation of time mark records separated with a newline (Unix text file) for each hypothesized word or punctuation mark. Systems outputs should be case sensitive, include punctuation marks and must use the UTF-8 encoding scheme. If punctuation marks are provided they should be in a separate line with the time code of the end of the previous word and with a duration of 0. Here is an example of CTM file with punctuation mark. 20050907_0900_1235_OR_SAT 1 322.768 0.120 who 20050907_0900_1235_OR_SAT 1 322.889 0.118 is 20050907_0900_1235_OR_SAT 1 323.011 0.194 with 20050907_0900_1235_OR_SAT 1 323.207 0.140 me 20050907_0900_1235_OR_SAT 1 323.353 0.470 today 20050907_0900_1235_OR_SAT 1 323.823 0.00 . Systems are supposed to use a single standardized spelling for each language. However some filtering and mapping will be applied to the system output prior to scoring to take into account acceptable common alternate forms. Both American English and British English spelling will be allowed. In addition to reference dictionaries, the Internet may be searched to find the most common form of a word (usually a proper name). If no form is dominant then more than one form will be allowed (cf. GLM table in scoring section). The system may use an optional hyphen to indicate the missing (unspoken) part of a word token. Filled pause makers and non-speech markers should not be included in the system output for scoring, however participants are encouraged to provide this information which may be useful for the SLT systems. 6. Scoring A NIST Segment Time Marked (STM) reference file will be provided for the development and test set (after the system submission). Following the NIST practice, contractions will be expanded in the STM file: i.e. the annotator will choose the single most likely expansion for each contraction. Non-scoreable regions (such as untranscribed areas and overlapping speech) will be explicitly tagged in the STM file for exclusion from scoring. Prior to scoring, a global mapping will be performed on both the reference and system outputs via a set of rules specified in a global map (GLM) file. The GLM rules expand contractions and split compound words in the system output to all possible expanded forms. Following NIST practice, optionally deletable tokens in the STM file may be omitted by the speech recognizer. These tokens contribute to the count of reference tokens whether or not the system outputs them. The CTM and STM files will be aligned (using dynamic programming) to minimize the word/character error rate. Scoring will done using the NIST speech recognition scoring toolkit available at http://www.nist.gov/speech/tools. Specific filtering tables and GLM files will be developed for TC-STAR (one set per language). Scoring will be case insensitive. A hyphen within a token will be treated as a token separator. 7. Enriched system output Participants should also provide (this is not required) a confidence score for each hypothesized word in the CTM file. This confidence score represents the system's estimate of the probability that the output token is correct. The correctness of the confidence scores will be evaluated using the normalized cross entropy score as reported by the NIST sclite tool. The confidence error rate (CER) will also be computed and reported. In addition to the CTM file, participants are encouraged to provide n-best hypotheses and/or word graphs to be used by the SLT systems. Sites who plan to provide N-best hypotheses or word lattices must also provide these outputs for the development set in order to solve issues related to file formats, vocabulary compatibility, segmentation, and decoding parameters. As it is expected that these issues will not be solved for all pairs of providers-users, ASR and SLT participants should team to solve these issues. Interfaces between ASR and SLT are not limited to N-best hypotheses and word lattices, so participants may consider alternative solutions for within site and cross-site site integration. 8. Processing time Even though processing speed is not a major issue for the 2007 evaluation, participants must provide information about the processing time and the resources (memory, CPU type, clock frequency) used to run the ASR systems. This should be included in each system description. Participants should report elapsed time (i.e. not real-time factor) for all steps if possible. ELDA will compute the processing speed as the ratio of the processing time to the official duration of the recorded audio data. The processing time is the total amount of elapsed time used to process the data on a single CPU, including I/O and all operations performed after first accessing the test data. 9. Result submission The recorded waveform files to be processed will be distributed on CD-ROM along with the unpartitionned evaluation map segmentation files (UEM) for the EPPS tasks. For each submission participants should send a compressed tar file to ELDA including the CTM files and the associated system descriptions. Word error rates will be tabulated separately for the three languages (Mandarin, English, Spanish) and for the two tasks (EPPS and BN). In addition to word/character error rates, NCE/CER measure and real-time factor will also be tabulated. 10. Schedule 21-Jan-07: ELDA send audio test data to participants (with UEM files for EPPS) 28-Jan-07: Deadline for submitted results to ELDA 31-Jan-07: ELDA sends preliminary results to participants with reference STM files 14-Feb-07: ELDA sends final results to participants with final reference STM files ------------------------------------------------------------------------