TC-STAR 2007 ASR EVALUATION PLAN  
	(08-Jan-07; updated on 15-Jan-2007)

1. Introduction

This document defines the ASR tasks, the evaluation conditions, the
development and test corpora, the scoring procedures, and the agenda
for the 2007 TC-STAR ASR evaluation. The 2007 ASR evaluation is done
in conjunction with the 2007 TC-STAR SLT evaluation and takes into
account the SLT requirements. 
The 2007 evaluation plan supports ASR evaluation for three languages 
and two tasks: 
- Parliament Sessions for English and Spanish
- Broadcast News for Mandarin Chinese 

-ASR English systems will be evaluated on recordings of the European 
Parliament Plenary Sessions (EPPS).
-ASR Spanish systems will be evaluated on EPPS data and on recordings of 
 the Spanish National Parliament (CORTES) as well.
-ASR Mandarin Chinese systems will be evaluated on broadcast emissions of 
 the Voice of America radio.

Participants must submit a complete result for at least one of the
three tasks (EPPS English, EPPS/Cortes Spanish, and BN Mandarin).  All
the segments of a given test set (there are 3 test sets) must be
entirely processed using the same system without any manual
intervention, i.e. the same system and the same tuning musty be used
for the Spanish EPPS data and Cortes data. Participants can submit as
many contrastive results as they like.  For the EPPS tasks,
participants are strongly encouraged to submit systems trained on the
restricted set of corpora.

For English and Spanish, word error rates will be computed in the following
conditions:
- case sensitive with punctuation
- case sensitive without punctuation
- case insensitive without punctuation (primary scoring)

Case sensitive and punctuation are not mandatory but recommended.

For Mandarin Chinese, Character error rate will be computed with and without
punctuation.

2. The ASR tasks and languages

The TC-STAR'07 evaluation will include data sets in English, Spanish
and Mandarin. Two tasks are supported for this evaluation, namely the
Parliament Session task  for English (EPPS) and
Spanish (EPPS and Spanish Parliament) 
and the Broadcast News task (BN) for Mandarin.

The following three test conditions will be used for the TC-STAR'07
evaluation:

- European Parliament Plenary Session in English (EPPS_EN), 3h test
set from June to September 2006, with only original speeches(i.e. not the
translator's speech).

- European Parliament Plenary Session in Spanish (EPPS_SP)provided by ELDA, 3h
test set from June to September 2006.
with only original speeches.

- National Parliament Sessions in Spanish (PARL_SP) provided by ELDA, 3h test
set from June 2006

- Broadcast news Mandarin (BN_MAN) with automatic segmentation, about
3h of VOA taken from the LDC TDT3 corpus from December 1998

Participants can build systems for any processing speed, so there are
no specific speed categories. However participants must report the
total time (wall to wall elapsed time) needed to process the data for
each submitted system. The real-time factor will be included in the
tabulated results along with the word/character error rates.

3. Processing rules

The evaluated systems must be fully automatic requiring no manual
interventions that have an impact on the system output.  Systems will
be provided with audio files (16kHz sampling rate, 16 bit samples,
mono) using a standard format.  Unless specified in this document, no
other information about the test data can be used. Supervised model
adaptation on the test data is not allowed.

Data material (audio, texts, etc.) generated after the training
cut-off date (or during the blackout period) cannot be used for system
training or development (see evaluation schedule) with the exception
of the official development data.  For broadcast news, the show
identity and the broadcast date are allowable side information that
systems may use.

For the English and Spanish, all training material must predate MAY 31st, 2006.
For the Mandarin broadcast news task, developers can not use any
training data from the month of DECEMBER 1998.

No manual segmentations will be provided.
A NIST Unpartitionned Evaluation Map file will be provided for 
EPPS English and EPPS/CORTES Spanish.

Participants can use data collected within TC-STAR and any publicly
available data (essentially from LDC and ELDA) predating the training
cut-off date (see schedule below) for system development. They can
also use any data they may find to be useful under the condition that
this data predates the cut-off date (or doesn't fall in the blackout
period for BN in Mandarin) and that they also submit results for a
system using only TC-STAR and/or public data. In all cases, the
training data should be fully and unambiguously documented in the
system description.

It follows that for the EPPS tasks we can distinguish three training
conditions:

1) a restricted training condition for which systems must be trained
only on data collected within the TC-STAR project and listed in the
next section.

2) a public data condition for which systems can be trained on any
publicly available data

3) an open condition where the only constraint concerns the cut-off
date of the training data.

Participants must submit a complete result for at least one of the
three languages (English, Spanish and Mandarin).  They can
submit as many contrastive results as they like.  For the Parliament tasks,
participants are strongly encouraged to submit systems trained on the
restricted set of corpora.

4. Data for the 2007 evaluation

There is no new development data for 2007.
Therefore previous development and evaluation data (dev06, eval06, dev05,
eval05) 
can be used for system development.

The following table summarizes the main attributes of the development
and evaluation data for the TC-STAR 2007 evaluation.

+==========+========+=======+==============+======+===========+
| Language |DataType| Domain|     Epoch    |Amount| Delivery  |
+==========+========+=======+==============+======+===========+
| English  |  Dev   | EPPS  | Oct04;Nov04; |	  |           |
|	   |        |       | Jun05;Sept05 | 12h  | available |
+----------+--------+-------+--------------+------+-----------+
| Spanish  |        |       | Oct04;Nov04; |      |           |
|	   |  Dev   | EPPS  | Jun05;Sept05;| 12h  | available |
|	   |	    |	    | Oct05;Nov05  |      |           |
+----------+----------------+--------------+------+-----------+
| Spanish  |  Dev   | PARL  | Dec04;Nov05  |  6h  | available |
+----------+----------------+--------------+------+-----------+
| Mandarin |  Dev   | BN    | Dec 1998     |  12h | available |
+==========+========+=======+==============+======+===========+
| English  |  Eval  | EPPS  | Jun06-Sept06 |  3h  | 21 Jan 07 |
+----------+----------------+--------------+------+-----------+
| Spanish  |  Eval  | EPPS  | Jun06-Sept06 |  3h  | 21 Jan 07 |
+----------+----------------+--------------+------+-----------+
| Spanish  |  Eval  | PARL  | Jun06-Sept06 |  3h  | 21 Jan 07 | 
+----------+----------------+--------------+------+-----------+
| Mandarin |  Eval  | BN    |  Dec 98      |  3h  | 21 Jan 07 |
+==========+========+=======+==============+======+===========+


The 2006 EPPS development and test data comprise ONLY 
original speeches and NOT the translated speech from interpreters. 
The 2005 EPPS development and test data includes interpreters and 
politicians speeches.

The same kind of data as the 2006 data will also be used to evaluate the
systems, 
so the 2007 test data includes ONLY politicians speeches.

The following table contains a non-exhaustive list of corpora that
participants may want to use to train their acoustic models:

- English TC-STAR EPPS, about 100h of transcribed data, (May 2004 - Jan 2005)
- English TC-STAR EPPS, about 200h of untranscribed data, (Jan 2005 - May 2006)
- English LDC 1995 (CSR-IV Hub 4 Marketplace LDC96S31), 1996, 1997, 
  official NIST Hub4 training sets,  LDC97S44 and LDC98S71, 
  USC Marketplace Broadcast News Speech (LDC99S82)
- English LDC TDT2 and TDT3 data with closed-captions, about 2000h,
  LDC99S84 and LDC2001S94
- English LDC Switchboard 1, 2-I, 2-II, 2-III, LDC97S62, LDC98S75, 
  LDC99S79
- English LDC Callhome, LDC97S42, LDC2004S05, LDC2004S09
- English LDC Meeting corpora, ICSI LDC2004S02, ISL LDC2004S05, 
  NIST LDC2004S09
- Spanish TC-STAR EPPS, about 60h of transcribed data (May 2004 - Jan 2005)
- Spanish TC-STAR EPPS, about 200h of untranscribed data (Jan 2005 - May 2006)
- Spanish TC-STAR CORTES, about 40h of transcribed data 
- Spanish LDC 1997, BN speech (Hub4-NE), LDC98S74
- Spanish LDC CallHome, LDC96S35
- Mandarin LDC 1997, BN speech (Hub4-NE), about 30h of transcribed data,
  LDC98S73
- Mandarin TDT2 and TDT3 data with quick transcriptions, LDC2001S93 
  and LDC2001S95 

The following corpora can be used for language model development:

- All transcriptions (detailed, quick or CC) of the above mentioned 
  audio corpora
- English EPPS final transcriptions, about 36M words (from parallel texts)
- English UK Parliament text corpus, about 40M words (HANSARD)
- English LDC NAB text corpus
- English LDC Gigaword (over 1 billion words)
- Spanish EPPS final transcriptions, about 36M words (from parallel texts)
- Mandarin LDC news text, about 250 million GB-encoded text characters
- Mandarin LDC Gigaword, about 1.1 billion words

For more complete listings of possible corpora, participants are
referred to the LDC and ELRA catalogs.

For the EPPS tasks, participants are encouraged to submit systems
trained only a restricted set of training corpora including:

- English TC-STAR EPPS, about 100h of transcribed data and 200h of 
  untranscribed data
- English EPPS final transcriptions, about 36M words (from parallel texts)
- Spanish TC-STAR EPPS, about 60h of transcribed data and 200h of 
  untranscrided data
- Spanish TC-STAR CORTES, about 40h of transcribed data
- Spanish EPPS final transcriptions, about 36M words (from parallel texts)
- Spanish parliament transcription from 1979 to October 15h 2004 

More information is available here:
http://www.elda.org/en/proj/tcstar-wp4/tcs-asr-run3.htm
  

5. System outputs

For each input audio file the ASR hypothesis are to be formatted in a
NIST CTM file, i.e. the concatenation of time mark records separated
with a newline (Unix text file) for each hypothesized word or punctuation mark.
Systems outputs should be case sensitive, include punctuation marks
 and must use the UTF-8 encoding scheme.

If punctuation marks are provided they should be in a separate line
with the time code of the end of the previous word and with a duration of 0.

Here is an example of CTM file with punctuation mark.

20050907_0900_1235_OR_SAT 1 322.768 0.120 who
20050907_0900_1235_OR_SAT 1 322.889 0.118 is
20050907_0900_1235_OR_SAT 1 323.011 0.194 with
20050907_0900_1235_OR_SAT 1 323.207 0.140 me
20050907_0900_1235_OR_SAT 1 323.353 0.470 today
20050907_0900_1235_OR_SAT 1 323.823 0.00 .

Systems are supposed to use a single standardized spelling for each
language.  However some filtering and mapping will be applied to the
system output prior to scoring to take into account acceptable common
alternate forms. Both American English and British English spelling
will be allowed.

In addition to reference dictionaries, the Internet may be searched to
find the most common form of a word (usually a proper name). If no
form is dominant then more than one form will be allowed (cf. GLM
table in scoring section).

The system may use an optional hyphen to indicate the missing
(unspoken) part of a word token. Filled pause makers and non-speech
markers should not be included in the system output for scoring,
however participants are encouraged to provide this information which
may be useful for the SLT systems.


6. Scoring

A NIST Segment Time Marked (STM) reference file will be provided for
the development and test set (after the system submission). Following
the NIST practice, contractions will be expanded in the STM file:
i.e. the annotator will choose the single most likely expansion for
each contraction. Non-scoreable regions (such as untranscribed areas
and overlapping speech) will be explicitly tagged in the STM file for
exclusion from scoring.

Prior to scoring, a global mapping will be performed on both the
reference and system outputs via a set of rules specified in a global
map (GLM) file.  The GLM rules expand contractions and split compound
words in the system output to all possible expanded forms.

Following NIST practice, optionally deletable tokens in the STM file
may be omitted by the speech recognizer.  These tokens contribute to
the count of reference tokens whether or not the system outputs them.

The CTM and STM files will be aligned (using dynamic programming) to
minimize the word/character error rate. Scoring will done using the
NIST speech recognition scoring toolkit available at
http://www.nist.gov/speech/tools. Specific filtering tables and GLM
files will be developed for TC-STAR (one set per language). Scoring
will be case insensitive. A hyphen within a token will be treated as a
token separator.


7. Enriched system output

Participants should also provide (this is not required) a confidence
score for each hypothesized word in the CTM file.  This confidence
score represents the system's estimate of the probability that the
output token is correct.  The correctness of the confidence scores
will be evaluated using the normalized cross entropy score as reported
by the NIST sclite tool. The confidence error rate (CER) will also be
computed and reported.

In addition to the CTM file, participants are encouraged to provide
n-best hypotheses and/or word graphs to be used by the SLT systems.
Sites who plan to provide N-best hypotheses or word lattices must also
provide these outputs for the development set in order to solve issues
related to file formats, vocabulary compatibility, segmentation, and
decoding parameters. As it is expected that these issues will not be
solved for all pairs of providers-users, ASR and SLT participants
should team to solve these issues. Interfaces between ASR and SLT are
not limited to N-best hypotheses and word lattices, so participants
may consider alternative solutions for within site and cross-site site
integration.


8. Processing time

Even though processing speed is not a major issue for the 2007
evaluation, participants must provide information about the processing
time and the resources (memory, CPU type, clock frequency) used to run
the ASR systems. This should be included in each system
description. Participants should report elapsed time (i.e. not
real-time factor) for all steps if possible.  ELDA will compute the
processing speed as the ratio of the processing time to the official
duration of the recorded audio data. The processing time is the total
amount of elapsed time used to process the data on a single CPU,
including I/O and all operations performed after first accessing the
test data.

9. Result submission

The recorded waveform files to be processed will be distributed on
CD-ROM along with the unpartitionned evaluation map segmentation 
files (UEM) for the EPPS tasks.

For each submission participants should send a compressed tar file to
ELDA including the CTM files and the associated system descriptions.

Word error rates will be tabulated separately for the three languages
(Mandarin, English, Spanish) and for the two tasks (EPPS and BN). In
addition to word/character error rates, NCE/CER measure and real-time
factor will also be tabulated.


10. Schedule

21-Jan-07: ELDA send audio test data to participants 
           (with UEM files for EPPS)
28-Jan-07: Deadline for submitted results to ELDA
31-Jan-07: ELDA sends preliminary results to participants 
           with reference STM files
14-Feb-07: ELDA sends final results to participants with 
           final reference STM files

------------------------------------------------------------------------