TC-Star Evaluation Information (WP4)

subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link
subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link
subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link
subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link
subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link
subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link
subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link
subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link

ASR Evaluation - Run #3

 

Update History

2006-09-04 ASR Run#3 web page released
2006-12-06 List of audio files for training in restricted condition added (English)
2007-01-03 List of audio files for training in restricted condition added (Spanish)
2007-01-10 Official evaluation plan and submission protocol updated
2007-01-10 HANSARD text corpus available
2007-01-17 Official evaluation plan and submission protocol updated
2007-01-28 Submissions table added
2007-01-31 Preliminary results added
2007-02-19 Final results for English
2007-02-20 English Results updated
2007-03-01 Final results for En, Es, Zh
2007-03-21 2005/2006 Systems results added

TC-STAR Evaluation Run #3 for ASR will take place from Jan. 21, 2007 to Jan. 28, 2007. The development data is already available (see development data section). The complete schedule can be seen here, but we can outline the important dates for ASR:

  • ELDA sends the test data to participants Jan. 21, 2007
  • Deadline for submitting results to ELDA Jan. 28, 2007 (23h59 Central European Time)
  • ELDA sends preliminary results to participants with reference Jan. 31, 2007
  • ELDA sends FINAL results to participants with final reference Feb. 14, 2007

The 2007 evaluation protocol is now available.

Here is the submission protocol for sending your results.

Before the proper evaluation run, participants have access to training data and development data composed of audio files and transcriptions. See below training data and development data.

ASR evaluation will be run in 3 languages: English, Spanish and Chinese Mandarin.

English is run on recordings from the European Parliament Plenary Sessions (EPPS)

Spanish is run on recordings from the European Parliament Plenary Sessions (EPPS) and from the CORTES Spanish parliament .

Chinese is run from recordings of Voice of America.

Back to Top

ASR Participants

 

  EPPS ENGLISH EPPS SPANISH CORTES SPANISH Broadcast news Mandarin Punctuation
LIMSI
X
X
X
X
Yes
UKA
X
X
?
IBM
X
X
X
?
IRST
X
X
X
Yes
UPC
X
X
?
RWTH
X
X
X
Yes

This table has to be confirmed by TC-STAR partners.

External Participants

  EPPS ENGLISH EPPS SPANISH CORTES SPANISH Broadcast news Mandarin Punctuation
ATR
X
X
?
LIUM
X
X
X
   
DAEDALUS
X
X
X
   
UPV/EHU  
X
X
   

ATR: Advanced Telecommunications Research Institute International, Japan

LIUM: Laboratoire d'Informatique de l'Université du Maine (LIUM), France

DAEDALUS: Data, Decisions and Language, S. A, Spain

UPV/EHU: Universidad de País Vasco, Spain

Back to Top

 

Submissions

 

  English Spanish Mandarin
ATR      
DAEDALUS   2P  
IBM 1O+1P+1R 1O+1R  
ITC-irst 4P+1R 2P+2R  
LIMSI 1P* 1R* 1O
LIUM 1P+1R+1P* 1R  
RWTH 2P+2R 2R  
UKA 6P   1O
UPC   1R  
UPV      
TC-STAR 1P* 2P*  

O=Open;P=Public;R=Restricted

*= late submission

Results

 

Results
Systems desccription
Reference files
eval07en_system_descriptions.tgz
eval07es_system_descriptions.tgz
eval07zh_system_descriptions.tgz

 

 

ASR Resources

Training data

The TC-STAR 2007 audio training corpus is made of transcribed and non transcribed data. The total amount of data is 300 hours for English and 330 hours for Spanish. For Mandarin no specific training data was produced within TC-STAR.

For the RESTRICTED CONDITION the lists of audio material that can be used for training are listed here:

For Spanish : EPPS07ES_TRAIN

For English : EPPS07EN_TRAIN

The HANSARD text corpus consists of debates of the U. K. Parliament from Nov 1999 to May 2006 and can be used for Language Models (see README )

 

 

 

Transcribed

Total transcribed

untranscribed

Total

Politicians

Interpreters

 

 

 

EPPS

English

21 h

70 h

101 h

200 h

301 h

EPPS

Spanish

10 h

51 h

 

 

 

100 h

230 h

330 h

 

PARL

Spanish

38 h

0

 

 

The following table gives a list of other resources helpful for training purposes.

Language Reference Amount IPR-owner IPR-distrib IPR-granted use IPR-royalty Actors / comments
Training
Zh Mandarin 1997 BN (Hub4-NE) LDC98S73 (audio) & LDC98T24 (transcr) ~30h ? LDC research LDC membership 98 required  
  Mandarin 2001 Call (Hub5) LDC98S69, LDC98T26 (transcr) ~40h ? LDC research LDC membership 98 required  
  Mandarin TDT2 LDC2001S93 & LDC2001T57 (transcr)   ? LDC research LDC membership 01 required  
  Mandarin TDT3 LDC2001S95 & LDC2001T58   ? LDC research LDC membership 01 required BLACKOUT ON DEC 98!!!
  Mandarin Chinese News Text LDC95T13 250M words ? LDC research LDC membership 95 required  
  Mandarin CALLHOME LDC96S34, LDC96T16 (transcr)   ? LDC research LDC membership 96 required  
  Chinese Gigaword LDC2003T09 1.1G words ? LDC research LDC membership 03 required  
  Hong Kong News Parallel Text LDC2000T46 (Zh/En) 18147 articles ? LDC research LDC membership 00 required  
ES EPPS_SP (text): Apr 1996 - May 2006 >36M words RWTH ELRA research nominal fee (RWTH) Provided to TCSTAR by RWTH
  EPPS Verbatim transcriptions May 2004 - January 2005 102h         Transcribed by UPC
  EPPS untranscribed data February 2005 - May 2006 160h          
  TC-STAR_P Spanish BN 10h transcribed ? UPC research free in TCSTAR Provided to TCSTAR by UPC
  Spanish LDC 1997, BN speech (Hub4-NE), LDC98S74   ? LDC research LDC 98 membership required  
  Spanish LDC CallHome, LDC96S35   ? LDC research LDC 96membership required  
En EPPS_EN (text): Apr 1996 - May 2006 >36M words RWTH ELRA research nominal fee (RWTH) Provided to TCSTAR by RWTH
  EPPS Verbatim transcriptions May 2004 - January 2005 100h         Transcribed by RWTH
  EPPS untranscribed data February 2005 - May 2006 215h          
  TC-STAR_P English BN 10h transcribed RFI ELRA research free in TCSTAR Distributed by ELDA
  English LDC 1995 (CSR-IV Hub 4 Marketplace LDC96S31), 1996, 1997, official NIST Hub4 training sets, LDC97S44 and LDC98S71, USC Marketplace Broadcast News Speech (LDC99S82)   ? LDC research LDC 96, 98 and 99 membership required  
  English LDC TDT2 and TDT3 data with closed-captions, about 2000h, LDC99S84 and LDC2001S94   ? LDC research LDC 99 and 01 membership required  
  English LDC Switchboard 1, 2-I, 2-II, 2-III, LDC97S62, LDC98S75, LDC99S79   ? LDC research LDC 98, 98 and 99 membership required  
  English LDC Callhome, LDC97S42, LDC2004S05, LDC2004S09   ? LDC research LDC 97 and 04 membership required  
  English LDC Meeting corpora, ICSI LDC2004S02, ISL LDC2004S05, NIST LDC2004S09   ? LDC research LDC 04 membership required  
  HANSARD TEXT CORPUS 48 M words   ELRA research    

Back to Top

Development Data

To get the corresponding audio files on DVD, contact Djamel Mostefa.

Verbatim transcriptions of EPPS are common with SLT evaluation.

Data Set Files

EPPS English Verbatim Transcriptions:

20050606_1700_1915_OR_SAT
20050607_0900_1215_OR_SAT
20050607_1500_1840_OR_SAT
20050608_0900_1300_OR_SAT
20050608_1505_1815_OR_SAT
20050609_1000_1205_OR_SAT
20050609_1500_1650_OR_SAT

English development package version 3

Validation report from SPEX

Statistics

EPPS Spanish Verbatim Transcriptions:

20050606_1700_1915_OR_SAT
20050607_0900_1215_OR_SAT
20050607_1500_1840_OR_SAT
20050608_0900_1300_OR_SAT
20050608_1505_1815_OR_SAT
20050609_1000_1205_OR_SAT
20050609_1500_1650_OR_SAT
20050704_1705_1915_OR_SAT
20050705_0900_1130_OR_SAT
20050705_1505_1920_OR_SAT
20050706_0900_1230_OR_SAT
20050706_1500_1755_OR_SAT
20050707_1000_1215_OR_SAT
20050707_1545_1750_OR_SAT

CORTES Spanish Parliament:

PARL_041201_01_ES
PARL_041201_02_ES
PARL_041201_03_ES
PARL_041201_04_ES
PARL_041201_05_ES
PARL_041202_01_ES

Spanish development package version 4

 

Validation report from SPEX

Statistics

Back to Top

 

Data statistics

Here are some statisctics about the development and test sets for English and Spanish

Statistics on the English dev set

TOTAL

MALE

FEMALE

NATIVE

NONNATIVE

NATIVE

NONNATIVE

# Speakers

41

26

6

6

3

Duration

3h

31.34 %

43.92%

18.75%

5.99%

Perplexity

20.51

Statistics on the Spanish dev set

 

TOTAL

MALE SPEAKERS

FEMALE SPEAKERS

#Speakers

61

44

17

Duration

5.8h

79.69%

20.31%

Pexplexity

34.12

Back to Top

 

 

Back to Top