TC-Star Evaluation Information (WP4)

subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link
subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link
subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link
subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link
subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link
subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link
subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link
subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link

End-to-End Evaluation - Run #2

 

Protocol

- 20 speeches of 3 minutes

- 1 TC-STAR speech, 1 interpreter speech

- 20 assessors which evaluate Adequacy (comprehension test) and Fluency (subjective test)

The complete protocol can be found here.

Subjective test:

Test    
Understanding ¿Cree que ha comprendido el mensaje? 1: No, nada en absoluto => 5: Sí, completamente
Fluently ¿La salida del sistema es fluída? 1: No, ¡es muy mala! => 5: Sí, ¡está en un castellano perfecto!
Effort Evalúe el esfuerzo requerido durante la escucha 1: muy alto => 5: muy bajo, es habla natural
Overall Quality Evalúe la calidad general del sistema de traducción 1: Muy malo, inutilizable => 5: Es muy útil

Data

Test data

Component Input
ASR ROVER

 

SLT RWTH
TTS ITP

UPC

Back to Top

Results

Preliminary results are available (access is restricted to participant only):

Subjective evaluation

System Audio

Understanding

(1: very bad; 5: perfect)

Fluently

(1: very bad; 5: perfect)

Effort

(1: very bad; 5: perfect)

Overall Quality

(1: very bad; 5: perfect)

ITP Audio 1
5
5
4
4
Audio 2
4
3
2
4
Audio 3
5
5
5
4
4
5
4
5
Audio 4
4
5
4
5
Audio 5
3
3
3
3
3
5
3
4
Audio 6
2
1
1
1
1
1
1
1
Audio 7
2
3
3
2
3
3
2
4
Audio 8
4
4
4
5
Audio 9
2
2
2
2
Audio 10
5
5
4
5
Audio 11
3
4
2
3
Audio 12
2
1
5
1
3
3
4
4
Audio 13
3
1
3
2
2
4
2
3
Audio 14
3
3
3
3
3
2
1
2
Audio 15
4
4
4
5
5
5
5
5
Audio 16
3
1
2
2
4
4
3
4
Audio 17
4
4
4
4
5
5
5
5
Audio 18
3
4
4
4
Audio 19
4
4
3
4
Audio 20
5
5
4
5
4
4
3
4
 
mean
3.45
3.48
3.19
3.52
   
TC-STAR Audio 1
3
1
2
2
Audio 2
3
5
3
4
1
1
1
1
Audio 3
1
2
1
1
Audio 4
1
2
1
2
2
1
1
1
Audio 5
3
2
1
2
3
2
3
3
Audio 6
3
1
2
1
Audio 7
4
4
3
4
Audio 8
4
3
2
2
Audio 9
1
2
1
1
2
1
1
1
Audio 10
2
3
2
2
Audio 11
4
3
2
4
Audio 12
2
1
1
2
Audio 13
3
1
1
1
Audio 14
2
2
1
1
1
1
1
1
Audio 15
2
1
1
2
Audio 16
3
2
3
2
2
2
1
2
Audio 17
2
1
1
1
1
1
1
1
Audio 18
3
2
2
3
2
2
1
2
Audio 19
2
2
1
2
3
3
3
3
Audio 20
3
2
1
2
 
mean
2.34
1.93
1.55
1.93

Comprehension evaluation

 

System Audio (mean)

E2E Evaluation

(0: bad; 1: good)

ITP / TTS

(0: bad; 1: good)

SLT

(0: bad; 1: good)

ASR

(0: bad; 1: good)

Only ITP = 1.00

(0: bad; 1: good)

ITP Audio 1
0.70
0.90
--
--
1.00
Audio 2
0.20
0.40
--
--
1.00
Audio 3
0.70
0.70
--
--
1.00
Audio 4
0.60
0.80
--
--
1.00
Audio 5
0.35
0.60
--
--
1.00
Audio 6
0.30
0.50
--
--
1.00
Audio 7
0.20
0.60
--
--
1.00
Audio 8
0.40
0.70
--
--
1.00
Audio 9
0.30
0.80
--
--
1.00
Audio 10
0.70
0.90
--
--
1.00
Audio 11
0.40
0.50
--
--
1.00
Audio 12
0.30
0.90
--
--
1.00
Audio 13
0.25
0.70
--
--
1.00
Audio 14
0.35
0.60
--
--
1.00
Audio 15
0.75
0.80
--
--
1.00
Audio 16
0.65
0.80
--
--
1.00
Audio 17
0.75
0.80
--
--
1.00
Audio 18
0.80
0.80
--
--
1.00
Audio 19
0.40
0.50
--
--
1.00
Audio 20
0.75
1.00
--
--
1.00
 
mean
0.50
0.72
--
--
1.00
   
       
TC-STAR Audio 1
0.80
1.00
1.00
1.00
1.00
Audio 2
0.90
1.00
1.00
1.00
1.00
Audio 3
0.50
0.90
0.90
1.00
0.86
Audio 4
0.55
0.90
0.90
0.90
0.88
Audio 5
0.70
0.90
0.90
1.00
1.00
Audio 6
0.70
0.90
0.90
0.90
1.00
Audio 7
0.50
0.80
0.90
0.90
0.83
Audio 8
0.80
0.90
0.90
1.00
0.88
Audio 9
0.30
0.90
0.90
1.00
0.88
Audio 10
0.50
0.50
0.60
0.60
0.56
Audio 11
0.35
0.90
0.90
0.90
1.00
Audio 12
0.50
0.90
0.90
0.90
1.00
Audio 13
0.60
0.60
0.60
0.60
0.88
Audio 14
0.55
0.60
0.60
0.70
0.67
Audio 15
1.00
1.00
1.00
1.00
1.00
Audio 16
0.60
0.70
1.00
1.00
1.00
Audio 17
0.25
0.70
0.70
0.80
0.88
Audio 18
0.65
0.80
0.90
0.90
1.00
Audio 19
0.60
0.70
0.80
1.00
0.80
Audio 20
0.40
0.90
0.90
1.00
0.90
 
mean
0.58
0.83
0.86
0.91
0.90

The columns show the following information:

-2 evaluated systems: ITP for the interpreter version and TC-STAR for the automatic speech-to-speech translation system

- the identifier of the audio file (corresponding data for interpreter and TC-STAR)

- E2E Evaluation: the evaluation was done by the same assessors who did the subjective evaluation.

- ITP / TTS: as it was not foreseen that results would be better for TC-STAR than for ITP, the audio files had been validated to check whether they contained the answers to the questions. The first conclusions that can be drawn from this are: it was difficult for the assessors to find the answers ( questions too hard?) and as the interpreter selects and reformulates the information, missing some details, then the question becomes too specific and not appropriate.

- TTS, SLT, ASR: in order to determine where the information was lost for the TC-STAR system, files from each component (recognized files for ASR, translated files for SLT, synthethized files for TTS) have been checked. The overall loss is 15% of the information, 5% being lost at each step.

- Only ITP: in the end, we used the questions whose answers were included in the interpreter files. So the TC-STAR system lost 10% of the information regarding the ITP evaluation (instead of 15%).