RSS twitter Login
Home Contact Login

LREC 2020 Paper Dissemination (9/10)

Share this page!
twitter google-plus linkedin share

LREC 2020 was not held in Marseille this year and only the Proceedings were published.

The ELRA Board and the LREC 2020 Programme Committee now feel that those papers should be disseminated again, in a thematic-oriented way, shedding light on specific “topics/sessions”.

Packages with several sessions will be disseminated every Tuesday for 10 weeks, from Nov 10, 2020 until the end of January 2021.

Each session displays papers’ title and authors, with corresponding abstract (for ease of reading) and url, in like manner as the Book of Abstracts we used to print and distribute at LRECs.

We hope that you discover interesting, even exciting, work that may be useful for your own research.

Group of papers sent on January 19, 2021

Links to each session


Sign Language Recognition and Generation

What Comes First: Combining Motion Capture and Eye Tracking Data to Study the Order of Articulators in Constructed Action in Sign Language Narratives

Tommi Jantunen, Anna Puupponen and Birgitta Burger

We use synchronized 120 fps motion capture and 50 fps eye tracking data from two native signers to investigate the temporal order in which the dominant hand, the head, the chest and the eyes start producing overt constructed action from regular narration in seven short Finnish Sign Language stories. From the material, we derive a sample of ten instances of regular narration to overt constructed action transfers in ELAN which we then further process and analyze in Matlab. The results indicate that the temporal order of articulators shows both contextual and individual variation but that there are also repeated patterns which are similar across all the analyzed sequences and signers. Most notably, when the discourse strategy changes from regular narration to overt constructed action, the head and the eyes tend to take the leading role, and the chest and the dominant hand tend to start acting last. Consequences of the findings are discussed.


LSF-ANIMAL: A Motion Capture Corpus in French Sign Language Designed for the Animation of Signing Avatars

Lucie Naert, Caroline Larboulette and Sylvie Gibet

Signing avatars allow deaf people to access information in their preferred language using an interactive visualization of the sign language spatio-temporal content. However, avatars are often procedurally animated, resulting in robotic and unnatural movements, which are therefore rejected by the community for which they are intended. To overcome this lack of authenticity, solutions in which the avatar is animated from motion capture data are promising. Yet, the initial data set drastically limits the range of signs that the avatar can produce. Therefore, it can be interesting to enrich the initial corpus with new content by editing the captured motions. For this purpose, we collected the LSF-ANIMAL corpus, a French Sign Language (LSF) corpus composed of captured isolated signs and full sentences that can be used both to study LSF features and to generate new signs and utterances. This paper presents the precise definition and content of this corpus, technical considerations relative to the motion capture process (including the marker set definition), the post-processing steps required to obtain data in a standard motion format and the annotation scheme used to label the data. The quality of the corpus with respect to intelligibility, accuracy and realism is perceptually evaluated by 41 participants including native LSF signers.


Sign Language Recognition with Transformer Networks

Mathieu De Coster, Mieke Van Herreweghe and Joni Dambre

Sign languages are complex languages. Research into them is ongoing, supported by large video corpora of which only small parts are annotated. Sign language recognition can be used to speed up the annotation process of these corpora, in order to aid research into sign languages and sign language recognition. Previous research has approached sign language recognition in various ways, using feature extraction techniques or end-to-end deep learning. In this work, we apply a combination of feature extraction using OpenPose for human keypoint estimation and end-to-end feature learning with Convolutional Neural Networks. The proven multi-head attention mechanism used in transformers is applied to recognize isolated signs in the Flemish Sign Language corpus. Our proposed method significantly outperforms the previous state of the art of sign language recognition on the Flemish Sign Language corpus: we obtain an accuracy of 74.7% on a vocabulary of 100 classes. Our results will be implemented as a suggestion system for sign language corpus annotation.


Annotating a Fable in Italian Sign Language (LIS)

Serena Trolvi and Rodolfo Delmonte

This paper introduces work carried out for the automatic generation of a written text in Italian starting from glosses of a fable in Italian Sign Language (LIS). The paper gives a brief overview of sign languages (SLs) and some peculiarities of SL fables such as the use of space, the strategy of Role Shift and classifiers. It also presents the annotation of the fable “The Tortoise and the Hare” - signed in LIS and made available by Alba Cooperativa Sociale -, which was annotated manually by first author for her master’s thesis. The annotation was the starting point of a generation process that allowed us to automatically generate a text in Italian starting from LIS glosses. LIS sentences have been transcribed with Italian words into tables on simultaneous layers, each of which contains specific linguistic or non-linguistic pieces of information. In addition, the present work discusses problems encountered in the annotation and generation process.


HamNoSyS2SiGML: Translating HamNoSys Into SiGML

Carolina Neves, Luísa Coheur and Hugo Nicolau

Sign Languages are visual languages and the main means of communication used by Deaf people. However, the majority of the information available online is presented through  written form. Hence, it is not of easy access to the Deaf community. Avatars that can animate sign languages have gained an increase of interest in this area due to their flexibility in the process of generation and edition. Synthetic animation of conversational agents can be achieved through the use of notation systems. HamNoSys is one of these systems, which describes movements of the body through symbols. Its XML-compliant, SiGML, is a machine-readable input of HamNoSys able to animate avatars. Nevertheless, current tools have no freely available open source libraries that allow the conversion from HamNoSys to SiGML. Our goal is to develop a tool of open access, which can perform this conversion independently from other platforms. This system represents a crucial intermediate step in the bigger pipeline of animating signing avatars. Two cases studies are described in order to illustrate different applications of our tool.


Dicta-Sign-LSF-v2: Remake of a Continuous French Sign Language Dialogue Corpus and a First Baseline for Automatic Sign Language Processing

Valentin Belissen, Annelies Braffort and Michèle Gouiffès

While the research in automatic Sign Language Processing (SLP) is growing, it has been almost exclusively focused on recognizing lexical signs, whether isolated or within continuous SL production. However, Sign Languages include many other gestural units like iconic structures, which need to be recognized in order to go towards a true SL understanding. In this paper, we propose a newer version of the publicly available SL corpus Dicta-Sign, limited to its French Sign Language part. Involving 16 different signers, this dialogue corpus was produced with very few constraints on the style and content. It includes lexical and non-lexical annotations over 11 hours of video recording, with 35000 manual units. With the aim of stimulating research in SL understanding, we also provide a baseline for the recognition of lexical signs and non-lexical structures on this corpus. A very compact modeling of a signer is built and a Convolutional-Recurrent Neural Network is trained and tested on Dicta-Sign-LSF-v2, with state-of-the-art results, including the ability to detect iconicity in SL production.


An HMM Approach with Inherent Model Selection for Sign Language and Gesture Recognition

Sandrine Tornay, Oya Aran and Mathew Magimai Doss

HMMs have been the one of the first models to be applied for sign recognition and have become the baseline models due to their success in modeling sequential and multivariate data. Despite the extensive use of HMMs for sign recognition, determining the HMM structure has still remained as a challenge, especially when the number of signs to be modeled is high. In this work, we present a continuous HMM framework for modeling and recognizing isolated signs, which inherently performs model selection to optimize the number of states for each sign separately during recognition. Our experiments on three different datasets, namely, German sign language DGS dataset, Turkish sign language HospiSign dataset and Chalearn14 dataset show that the proposed approach achieves better sign language or gesture recognition systems in comparison to the approach of selecting or presetting the number of HMM states based on k-means, and yields systems that perform competitive to the case where the number of states are determined based on the test set performance.


VROAV: Using Iconicity to Visually Represent Abstract Verbs

Simone Scicluna and Carlo Strapparava

For a long time, philosophers, linguists and scientists have been keen on finding an answer to the mind-bending question “what does abstract language look like?", which has also sprung from the phenomenon of mental imagery and how this emerges in the mind. One way of approaching the matter of word representations is by exploring the common semantic elements that link words to each other. Visual languages like sign languages have been found to reveal enlightening patterns across signs of similar meanings, pointing towards the possibility of identifying clusters of iconic meanings. With this insight, merged with an understanding of verb predicates achieved from VerbNet, this study presents a novel verb classification system based on visual shapes, using graphic animation to visually represent 20 classes of abstract verbs. Considerable agreement between participants who judged the graphic animations based on representativeness suggests a positive way forward for this proposal, which may be developed as a language learning aid in educational contexts or as a multimodal language comprehension tool for digital text.


MEDIAPI-SKEL - A 2D-Skeleton Video Database of French Sign Language With Aligned French Subtitles

Hannah Bull, Annelies Braffort and Michèle Gouiffès

This paper presents MEDIAPI-SKEL, a 2D-skeleton database of French Sign Language videos aligned with French subtitles. The corpus contains 27 hours of video of body, face and hand keypoints, aligned to subtitles with a vocabulary size of 17k tokens. In contrast to existing sign language corpora such as videos produced under laboratory conditions or translations of TV programs into sign language, this database is constructed using original sign language content largely produced by deaf journalists at the media company Média-Pi. Moreover, the videos are accurately synchronized with French subtitles. We propose three challenges appropriate for this corpus that are related to processing units of signs in context: automatic alignment of text and video, semantic segmentation of sign language, and production of video-text embeddings for cross-modal retrieval. These challenges deviate from the classic task of identifying a limited number of lexical signs in a video stream.


Alignment Data base for a Sign Language Concordancer

Marion Kaczmarek and Michael Filhol

This article deals with elaborating a data base of alignments of parallel Franch-LSF segments. This data base is meant to be searched using a concordancer which we are also designing. We wish to equip Sign Language translators with tools similar to those used in text-to-text translation. To do so, we need language resources to feed them. Already existing Sign Language corpora can be found, but do not match our needs: working around a Sign Language concordancer, the corpus must be a parallel one and provide various examples of vocabulary and grammatical construction. We started with a parallel corpus of 40 short news and 120 SL videos , which we aligned manually by segments of various length. We described the methodology we used, how we define our segments and alignments. The last part concerns how we hope to allow the data base to keep growing in a near future.


Evaluation of Manual and Non-manual Components for Sign Language Recognition

Medet Mukushev, Arman Sabyrov, Alfarabi Imashev, Kenessary Koishybay, Vadim Kimmelman and Anara Sandygulova

The motivation behind this work lies in the need to differentiate between similar signs that differ in non-manual components present in any sign. To this end, we recorded full sentences signed by five native signers and extracted 5200 isolated sign samples of twenty frequently used signs in Kazakh-Russian Sign Language (K-RSL), which have similar manual components but differ in non-manual components (i.e. facial expressions, eyebrow height, mouth, and head orientation). We conducted a series of evaluations in order to investigate whether non-manual components would improve sign's recognition accuracy. Among standard machine learning approaches, Logistic Regression produced the best results, 78.2% of accuracy for dataset with 20 signs and 77.9% of accuracy for dataset with 2 classes (statement vs question). Dataset can be downloaded from the following website:


TheRuSLan: Database of Russian Sign Language

Ildar Kagirov, Denis Ivanko, Dmitry Ryumin, Alexander Axyonov and Alexey Karpov

In this paper, a new Russian sign language multimedia database TheRuSLan is presented. The database includes lexical units (single words and phrases) from Russian sign language within one subject area, namely, "food products at the supermarket", and was collected using MS Kinect 2.0 device including both FullHD video and the depth map modes, which provides new opportunities for the lexicographical description of the Russian sign language vocabulary and enhances research in the field of automatic gesture recognition. Russian sign language has an official status in Russia, and over 120,000 deaf people in Russia and its neighboring countries use it as their first language. Russian sign language has no writing system, is poorly described and belongs to the low-resource languages.

The authors formulate the basic principles of annotation of sign words, based on the collected data, and reveal the content of the collected database. In the future, the database will be expanded and comprise more lexical units. The database is explicitly made for the task of creating an automatic system for Russian sign language recognition.


Social Media Processing

Back to Top

A Survey on Natural Language Processing for Fake News Detection

Ray Oshikawa, Jing Qian and William Yang Wang

Fake news detection is a critical yet challenging problem in Natural Language Processing (NLP). The rapid rise of social networking platforms has not only yielded a vast increase in information accessibility but has also accelerated the spread of fake news. Thus, the effect of fake news has been growing, sometimes extending to the offline world and threatening public safety. Given the massive amount of Web content, automatic fake news detection is a practical NLP problem useful to all online content providers, in order to reduce the human time and effort to detect and prevent the spread of fake news. In this paper, we describe the challenges involved in fake news detection and also describe related tasks. We systematically review and compare the task formulations, datasets and NLP solutions that have been developed for this task, and also discuss the potentials and limitations of them. Based on our insights, we outline promising research directions, including more fine-grained, detailed, fair, and practical detection models. We also highlight the difference between fake news detection and other related tasks, and the importance of NLP solutions for fake news detection.


RP-DNN: A Tweet Level Propagation Context Based Deep Neural Networks for Early Rumor Detection in Social Media

Jie Gao, Sooji Han, Xingyi Song and Fabio Ciravegna

Early rumor detection (ERD) on social media platform is very challenging when limited, incomplete and noisy information is available. Most of the existing methods have largely worked on event-level detection that requires the collection of posts relevant to a specific event and relied only on user-generated content. They are not appropriate to detect rumor sources in the very early stages, before an event unfolds and becomes widespread. In this paper, we address the task of ERD at the message level. We present a novel hybrid neural network architecture, which combines a task-specific character-based bidirectional language model and stacked Long Short-Term Memory (LSTM) networks to represent textual contents and social-temporal contexts of input source tweets, for modelling propagation patterns of rumors in the early stages of their development. We apply multi-layered attention models to jointly learn attentive context embeddings over multiple context inputs. Our experiments employ a stringent leave-one-out cross-validation (LOO-CV) evaluation setup on seven publicly available real-life rumor event data sets. Our models achieve state-of-the-art(SoA) performance for detecting unseen rumors on large augmented data which covers more than 12 events and 2,967 rumors. An ablation study is conducted to understand the relative contribution of each component of our proposed model.


Issues and Perspectives from 10,000 Annotated Financial Social Media Data

Chung-Chi Chen, Hen-Hsen Huang and Hsin-Hsi Chen

In this paper, we investigate the annotation of financial social media data from several angles. We present Fin-SoMe, a dataset with 10,000 labeled financial tweets annotated by experts from both the front desk and the middle desk in a bank's treasury. These annotated results reveal that (1) writer-labeled market sentiment may be a misleading label; (2) writer's sentiment and market sentiment of an investor may be different; (3) most financial tweets provide unfounded analysis results; and (4) almost no investors write down the gain/loss results for their positions, which would otherwise greatly facilitate detailed evaluation of their performance. Based on these results, we address various open problems and suggest possible directions for future work on financial social media data. We also provide an experiment on the key snippet extraction task to compare the performance of using a general sentiment dictionary and using the domain-specific dictionary. The results echo our findings from the experts' annotations.


Searching Brazilian Twitter for Signs of Mental Health Issues

Wesley Santos, Amanda Funabashi and Ivandré Paraboni

Depression and related mental health issues are often reflected in the language employed by the individuals who suffer from these conditions and, accordingly, research in Natural Language Processing (NLP) and related fields have developed an increasing number of studies devoted to their recognition in social media text. Some of these studies have also attempted to go beyond recognition by focusing on the early signs of these illnesses, and by analysing the users' publication history over time to potentially prevent further harm. The two kinds of study are of course overlapping, and often make use of supervised machine learning methods based on annotated corpora. However, as in many other fields, existing resources are largely devoted to English NLP, and there is little support for these studies in under resourced languages.To bridge this gap, in this paper we describe the initial steps towards building a novel resource of this kind - a corpus intended to support both the recognition of mental health issues and the temporal analysis of these illnesses - in the Brazilian Portuguese language, and initial results of a number of experiments in text classification addressing both tasks.


RedDust: a Large Reusable Dataset of Reddit User Traits

Anna Tigunova, Paramita Mirza, Andrew Yates and Gerhard Weikum

Social media is a rich source of assertions about personal traits, such as "I am a doctor" or "my hobby is playing tennis". Precisely identifying explicit assertions is difficult, though, because of the users’ highly varied vocabulary and language expressions. Identifying personal traits from implicit assertions like I’ve been at work treating patients all day is even more challenging. This paper presents RedDust, a large-scale annotated resource for user profiling for over 300k Reddit users across five attributes: profession, hobby, family status, age,and gender. We construct RedDust using a diverse set of high-precision patterns and demonstrate its use as a resource for developing learning models to deal with implicit assertions. RedDust consists of users’ personal traits, which are (attribute, value) pairs,  along with users’ post ids, which may be used to retrieve the posts from a publicly available crawl or from the Reddit API. We discuss the construction of the resource and show interesting statistics and insights into the data. We also compare different classifiers, which can be learned from RedDust. To the best of our knowledge, RedDust is the first annotated language resource about Reddit users at large scale. We envision further use cases of RedDust for providing background knowledge about user traits, to enhance personalized search and recommendation as well as conversational agents.


An Annotated Social Media Corpus for German

Eckhard Bick

This paper presents the German Twitter section of a large (2 billion word) bilingual Social Media corpus for Hate Speech research, discussing the compilation, pseudonymization and grammatical annotation of the corpus, as well as special linguistic features and peculiarities encountered in the data. Among other things, compounding, accidental and intentional orthographic variation, gendering and the use of emoticons/emojis are addressed in a genre-specific fashion. We present the different layers of linguistic annotation (morphosyntactic, dependencies and semantic types) and explain how a general parser (GerGram) can be made to work on Social Media data, pointing out necessary adaptations and extensions. In an evaluation run on a random cross-section of tweets, the modified parser achieved F-scores of 97% for morphology (fine-grained POS) and 92% for syntax (labeled attachment score). Predictably, performance was twice as good in tweets with standard orthography than in tweets with spelling/casing irregularities or lack of sentence separation, the effect being more marked for morphology than for syntax.


The rJokes Dataset: a Large Scale Humor Collection

Orion Weller and Kevin Seppi

Humor is a complicated language phenomenon that depends upon many factors, including topic, date, and recipient.  Because of this variation, it can be hard to determine what exactly makes a joke humorous, leading to difficulties in joke identification and related tasks.  Furthermore, current humor datasets are lacking in both joke variety and size, with almost all current datasets having less than 100k jokes.  In order to alleviate this issue we compile a collection of over 550,000 jokes posted over an 11 year period on the Reddit r/Jokes subreddit (an online forum), providing a large scale humor dataset that can easily be used for a myriad of tasks.  This dataset also provides quantitative metrics for the level of humor in each joke, as determined by subreddit user feedback.  We explore this dataset through the years, examining basic statistics, most mentioned entities, and sentiment proportions.  We also introduce this dataset as a task for future work, where models learn to predict the level of humor in a joke.  On that task we provide strong state-of-the-art baseline models and show room for future improvement.  We hope that this dataset will not only help those researching computational humor, but also help social scientists who seek to understand popular culture through humor.


EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus

Thomas Proisl, Natalie Dykes, Philipp Heinrich, Besim Kabashi, Andreas Blombach and Stefan Evert

The EmpiriST corpus (Beißwenger et al., 2016) is a manually tokenized and part-of-speech tagged corpus of approximately 23,000 tokens of German Web and CMC (computer-mediated communication) data. We extend the corpus with manually created annotation layers for word form normalization, lemmatization and lexical semantics. All annotations have been independently performed by multiple human annotators. We report inter-annotator agreements and results of baseline systems and state-of-the-art off-the-shelf tools.


Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection

Kai Nakamura, Sharon Levy and William Yang Wang

Fake news has altered society in negative ways in politics and culture. It has adversely affected both online social network systems as well as offline communities and conversations. Using automatic machine learning classification models is an efficient way to combat the widespread dissemination of fake news. However, a lack of effective, comprehensive datasets has been a problem for fake news research and detection model development. Prior fake news datasets do not provide multimodal text and image data, metadata, comment data, and fine-grained fake news categorization at the scale and breadth of our dataset. We present Fakeddit, a novel multimodal dataset consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classification categories through distant supervision. We construct hybrid text+image models and perform extensive experiments for multiple variations of classification, demonstrating the importance of the novel aspect of multimodality and fine-grained classification unique to Fakeddit.


Optimising Twitter-based Political Election Prediction with Relevance andSentiment Filters

Eric Sanders and Antal van den Bosch

We study the relation between the number of mentions of political parties in the last weeks before the elections and the election results.In this paper we focus on the Dutch elections of the parliament in 2012 and for the provinces (and the senate) in 2011 and 2015.  With raw counts, without adaptations, we achieve a mean absolute error (MAE) of 2.71% for 2011, 2.02% for 2012 and 2.89% for 2015. A set of over 17,000 tweets containing political party names were annotated by at least three annotators per tweet on ten features denoting communicative intent (including the presence of sarcasm,  the message’s polarity,  the presence of an explicit voting endorsement or explicit voting advice, etc.). The annotations were used to create oracle (gold-standard) filters. Tweets with or without a certain majority annotation are held out from the tweet counts, with the goal of attaining lower MAEs. With a grid search we tested all combinations of filters and their responding MAE to find the best filter ensemble.  It appeared that the filters show markedly different behaviour for the three elections and only a small MAE improvement is possible when optimizing on all three elections.  Larger improvements for one election are possible, but result in deterioration of the MAE for the other elections.


A Real-Time System for Credibility on Twitter

Adrian Iftene, Daniela Gifu, Andrei-Remus Miron and Mihai-Stefan Dudu

Nowadays, social media credibility is a pressing issue for each of us who are living in an altered online landscape. The speed of news diffusion is striking. Given the popularity of social networks, more and more users began posting pictures, information, and news about personal life. At the same time, they started to use all this information to get informed about what their friends do or what is happening in the world, many of them arousing much suspicion. The problem we are currently experiencing is that we do not currently have an automatic method of figuring out in real-time which news or which users are credible and which are not, what is false or what is true on the Internet. The goal of this is to analyze Twitter in real-time using neural networks in order to provide us key elements about both the credibility of tweets and users who posted them. Thus, we make a real-time heatmap using information gathered from users to create overall images of the areas from which this fake news comes.


A Corpus of Turkish Offensive Language on Social Media

Çağrı Çöltekin

This paper introduces a corpus of Turkish offensive language. To our knowledge, this is the first corpus of offensive language for Turkish. The corpus consists of randomly sampled micro-blog posts from Twitter. The annotation guidelines are based on a careful review of the annotation practices of recent efforts for other languages. The corpus contains 36 232 tweets sampled randomly from the Twitter stream during a period of 18 months between Apr 2018 to Sept 2019. We found approximately 19 % of the tweets in the data contain some type of offensive language, which is further subcategorized based on the target of the offense. We describe the annotation process, discuss some interesting aspects of the data, and present results of automatically classifying the corpus using state-of-the-art text classification methods. The classifiers achieve 77.3 % F1 score on identifying offensive tweets, 77.9 % F1 score on determining whether a given offensive document is targeted or not, and 53.0 % F1 score on classifying the targeted offensive documents into three subcategories.


From Witch’s Shot to Music Making Bones - Resources for Medical Laymen to Technical Language and Vice Versa

Laura Seiffe, Oliver Marten, Michael Mikhailov, Sven Schmeier, Sebastian Möller and Roland Roller

Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. Information we share online unveil directly or indirectly information about our lifestyle and health situation. Particularly when text input is getting longer or multiple messages can be linked to each other. Those information can be then used to detect possible risk factors of diseases or adverse drug reactions of medications.  However, as most people are not medical experts, language used might be more descriptive rather than the precise medical expression as medics do. To detect and use those relevant information, laymen language has to be translated and/or linked against the corresponding medical concept. This work presents baseline data sources in order to address this challenge for German language. We introduce a new dataset which annotates medical laymen and technical expressions in a patient forum, along with a set of medical synonyms and definitions, and present first baseline results on the data.


I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language

Tommaso Caselli, Valerio Basile, Jelena Mitrović, Inga Kartoziya and Michael Granitzer

Abusive language detection is an unsolved and challenging problem for the NLP community. Recent literature suggests various approaches to distinguish between different language phenomena (e.g., hate speech vs. cyberbullying vs. offensive language) and factors (degree of explicitness and target) that may help to classify different abusive language phenomena. There are data sets that annotate the target of abusive messages (i.e.OLID/OffensEval (Zampieri et al., 2019a)). However, there is a lack of data sets that take into account the degree of explicitness. In this paper, we propose annotation guidelines to distinguish between explicit and implicit abuse in English and apply them to OLID/OffensEval. The outcome is a newly created resource, AbuseEval v1.0, which aims to address some of the existing issues in the annotation of offensive and abusive language (e.g., explicitness of the message, presence of a target, need of context, and interaction across different phenomena).


A Multi-Platform Arabic News Comment Dataset for Offensive Language Detection

Shammur Absar Chowdhury, Hamdy Mubarak, Ahmed Abdelali, Soon-gyo Jung, Bernard J Jansen and Joni Salminen

Access to social media often enables users to engage in conversation with limited accountability. This allows a user to share their opinions and ideology, especially regarding public content, occasionally adopting offensive language. This may encourage hate crimes or cause mental harm to targeted individuals or groups. Hence, it is important to detect offensive comments in social media platforms. Typically, most studies focus on offensive commenting in one platform only, even though the problem of offensive language is observed across multiple platforms. Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment dataset, collected from multiple social media platforms, including Twitter, Facebook, and YouTube. We follow two-step crowd-annotator selection criteria for low-representative language annotation task in a crowdsourcing platform. Furthermore, we analyze the distinctive lexical content along with the use of emojis in offensive comments. We train and evaluate the classifiers using the annotated multi-platform dataset along with other publicly available data. Our results highlight the importance of multiple platform dataset for (a) cross-platform, (b) cross-domain, and (c) cross-dialect generalization of classifier performance.


Twitter Trend Extraction: A Graph-based Approach for Tweet and Hashtag Ranking, Utilizing No-Hashtag Tweets

Zahra Majdabadi, Behnam Sabeti, Preni Golazizian, Seyed Arad Ashrafi Asli, Omid Momenzadeh and Reza Fahmi

Twitter has become a major platform for users to express their opinions on any topic and engage in debates. User debates and interactions usually lead to massive content regarding a specific topic which is called a Trend. Twitter trend extraction aims at finding these relevant groups of content that are generated in a short period. The most straightforward approach for this problem is using Hashtags, however, tweets without hashtags are not considered this way. In order to overcome this issue and extract trends using all tweets, we propose a graph-based approach where graph nodes represent tweets as well as words and hashtags. More specifically, we propose a modified version of RankClus algorithm to extract trends from the constructed tweets graph. The proposed approach is also capable of ranking tweets, words and hashtags in each trend with respect to their importance and relevance to the topic. The proposed algorithm is used to extract trends from several twitter datasets, where it produced consistent and coherent results.


A French Corpus for Event Detection on Twitter

Béatrice Mazoyer, Julia Cagé, Nicolas Hervé and Céline Hudelot

We present Event2018, a corpus annotated for event detection tasks, consisting of 38 million tweets in French (retweets excluded) including more than 130,000 tweets manually annotated by three annotators as related or unrelated to a given event. The 243 events were selected both from press articles and from subjects trending on Twitter during the annotation period (July to August 2018). In total, more than 95,000 tweets were annotated as related to one of the selected events. We also provide the titles and URLs of 15,500  news articles automatically detected as related to these events. In addition to this corpus, we detail the results of our event detection experiments on both this dataset and another publicly available dataset of tweets in English. We ran extensive tests with different types of text embeddings and a standard Topic Detection and Tracking algorithm, and detail our evaluation method. We show that tf-idf vectors allow the best performance for this task on both corpora. These results are intended to serve as a baseline for researchers wishing to test their own event detection systems on our corpus.


Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling

Arindam Chatterjere, Vineeth Guptha, Parul Chopra and Amitava Das

Code-Mixing (CM) or language mixing is a social norm in multilingual societies. CM is quite prevalent in social media conversations in multilingual regions like - India, Europe, Canada and Mexico. In this paper, we explore the problem of Language Modeling (LM) for code-mixed Hinglish text. In recent times, there have been several success stories with neural language modeling like Generative Pre-trained Transformer (GPT) (Radford et al., 2019), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) etc.. Hence, neural language models have become the new holy grail of modern NLP, although LM for CM is an unexplored area altogether. To better understand the problem of LM for CM, we initially experimented with several statistical language modeling techniques and consequently experimented with contemporary neural language models. Analysis shows switching-points are the main challenge for the LMCM performance drop, therefore in this paper we introduce the idea of minority positive sampling to selectively induce more sample to achieve better performance. On the contrary, all neural language models demand a huge corpus to train on for better performance. Finally, we are reporting a perplexity of 139 for Hinglish (Hindi-English language pair) LMCM using statistical bi-directional techniques.


Do You Really Want to Hurt Me? Predicting Abusive Swearing in Social Media

Endang Wahyu Pamungkas, Valerio Basile and Viviana Patti

Swearing plays an ubiquitous role in everyday conversations among humans, both in oral and textual communication, and occurs frequently in social media texts, typically featured by informal language and spontaneous writing. Such occurrences can be linked to an abusive context, when they contribute to the expression of hatred and to the abusive effect, causing harm and offense. However, swearing is multifaceted and is often used in casual contexts, also with positive social functions. In this study, we explore the phenomenon of swearing in Twitter conversations, taking the possibility of predicting the abusiveness of a swear word in a tweet context as the main investigation perspective.  We developed the Twitter English corpus SWAD (Swear Words Abusiveness Dataset), where abusive swearing is manually annotated at the word level.  Our collection consists of 1,511 unique swear words from 1,320 tweets. We developed models to automatically predict abusive swearing, to provide an intrinsic evaluation of SWAD and  confirm the robustness of the resource. We also present the results of a glass box ablation study in order to investigate which lexical, syntactic, and affective features are more informative towards the automatic prediction of the function of swearing.


Detecting Troll Tweets in a Bilingual Corpus

Lin Miao, Mark Last and Marina Litvak

During the past several years, a large amount of troll accounts has emerged with efforts to manipulate public opinion on social network sites. They are often involved in spreading misinformation, fake news, and propaganda with the intent of distracting and sowing discord. This paper aims to detect troll tweets in both English and Russian assuming that the tweets are generated by some "troll farm." We reduce this task to the authorship verification problem of determining whether a single tweet is authored by a "troll farm" account or not. We evaluate a supervised classification approach with monolingual, cross-lingual, and bilingual training scenarios, using several machine learning algorithms, including deep learning. The best results are attained by the bilingual learning, showing the area under the ROC curve (AUC) of 0.875 and 0.828, for tweet classification in English and Russian test sets, respectively. It is noteworthy that these results are obtained using only raw text features, which do not require manual feature engineering efforts. In this paper, we introduce a resource of English and Russian troll tweets containing original tweets and translation from English to Russian, Russian to English. It is available for academic purposes.


Collecting Tweets to Investigate Regional Variation in Canadian English

Filip Miletic, Anne Przewozny-Desriaux and Ludovic Tanguy

We present a 78.8-million-tweet, 1.3-billion-word corpus aimed at studying regional variation in Canadian English with a specific focus on the dialect regions of Toronto, Montreal, and Vancouver. Our data collection and filtering pipeline reflects complex design criteria, which aim to allow for both data-intensive modeling methods and user-level variationist sociolinguistic analysis. It specifically consists in identifying Twitter users from the three cities, crawling their entire timelines, filtering the collected data in terms of user location and tweet language, and automatically excluding near-duplicate content. The resulting corpus mirrors national and regional specificities of Canadian English, it provides sufficient aggregate and user-level data, and it maintains a reasonably balanced distribution of content across regions and users. The utility of this dataset is illustrated by two example applications: the detection of regional lexical and topical variation, and the identification of contact-induced semantic shifts using vector space models. In accordance with Twitter’s developer policy, the corpus will be publicly released in the form of tweet IDs.


DAICT: A Dialectal Arabic Irony Corpus Extracted from Twitter

Ines Abbes, Wajdi Zaghouani, Omaima El-Hardlo and Faten Ashour

Identifying irony in user-generated social media content has a wide range of applications; however to date Arabic content has received limited attention. To bridge this gap, this study builds a new open domain Arabic corpus annotated for irony detection. We query Twitter using irony-related hashtags to collect ironic messages, which are then manually annotated by two linguists according to our working definition of irony. Challenges which we have encountered during the annotation process reflect the inherent limitations of Twitter messages interpretation, as well as the complexity of Arabic and its dialects. Once published, our corpus will be a valuable free resource for developing open domain systems for automatic irony recognition in Arabic language and its dialects in social media text.


Norm It! Lexical Normalization for Italian and Its Downstream Effects for Dependency Parsing

Rob van der Goot, Alan Ramponi, Tommaso Caselli, Michele Cafagna and Lorenzo De Mattei

Lexical  normalization  is  the  task  of  translating  non-standard  social  media  data  to  a  standard  form.   Previous  work  has  shown  that this is beneficial for many downstream tasks in multiple languages.  However, for Italian, there is no benchmark available for lexical normalization,  despite the presence of many benchmarks for other tasks involving social media data.   In this paper,  we discuss the creation of a lexical normalization dataset for Italian. After two rounds of annotation, a Cohen’s kappa score of 78.64 is obtained. During this process, we also analyze the inter-annotator agreement for this task, which is only rarely done on datasets for lexical normalization,and when it is reported, the analysis usually remains shallow. Furthermore, we utilize this dataset to train a lexical normalization model and show that it can be used to improve dependency parsing of social media data. All annotated data and the code to reproduce the results are available at:


TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus

Elisa Gugliotta and Marco Dinarelli

This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and "arithmographs" (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. Arabish differs for each Arabic dialect and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the attention of NLP studies on Arabic dialects has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian Arabish.


Small Town or Metropolis? Analyzing the Relationship between Population Size and Language

Amy Rechkemmer, Steven Wilson and Rada Mihalcea

The variance in language used by different cultures has been a topic of study for researchers in linguistics and psychology, but often times, language is compared across multiple countries in order to show a difference in culture. As a geographically large country that is diverse in population in terms of the background and experiences of its citizens, the U.S. also contains cultural differences within its own borders. Using a set of over 2 million posts from distinct Twitter users around the country dating back as far as 2014, we ask the following question: is there a difference in how Americans express themselves online depending on whether they reside in an urban or rural area? We categorize Twitter users as either urban or rural and identify ideas and language that are more commonly expressed in tweets written by one population over the other. We take this further by analyzing how the language from specific cities of the U.S. compares to the language of other cities and by training predictive models to predict whether a user is from an urban or rural area. We publicly release the tweet and user IDs that can be used to reconstruct the dataset for future studies in this direction.


Inferring Social Media Users' Mental Health Status from Multimodal Information

Zhentao Xu, Verónica Pérez-Rosas and Rada Mihalcea

Worldwide, an increasing number of people are suffering from mental health disorders such as depression and anxiety. In the United States alone, one in every four adults suffers from a mental health condition, which makes mental health a pressing concern. In this paper, we explore the use of multimodal cues present in social media posts to predict users' mental health status. Specifically, we focus on identifying social media activity that either indicates a mental health condition or its onset. We collect posts from Flickr and apply a multimodal approach that consists of jointly analyzing language, visual, and metadata cues and their relation to mental health. We conduct several classification experiments aiming to discriminate between (1) healthy users and users affected by a mental health illness; and (2) healthy users and users prone to mental illness. Our experimental results indicate that using multiple modalities can improve the performance of this classification task as compared to the use of one modality at a time, and can provide important cues into a user's mental status.


Synthetic Data for English Lexical Normalization: How Close Can We Get to Manually Annotated Data?

Kelly Dekker and Rob van der Goot

Social media is a valuable data resource for various natural language processing (NLP) tasks. However, standard NLP tools were often designed with standard texts in mind, and their performance decreases heavily when applied to social media data. One solution to this problem is to adapt the input text to a more standard form, a task also referred to as normalization.  Automatic approaches to normalization have shown that they can be used to improve performance on a variety of NLP tasks. However, all of these systems are supervised, thereby being heavily dependent on the availability of training data for the correct language and domain. In this work, we attempt to overcome this dependence by automatically generating training data for lexical normalization.  Starting with raw tweets, we attempt two directions, to insert non-standardness (noise) and to automatically normalize in an unsupervised setting. Our best results are achieved by automatically inserting noise. We evaluate our approaches by using an existing lexical normalization system; our best scores are achieved by custom error generation system, which makes use of some manually created datasets. With this system, we score 94.29 accuracy on the test data, compared to 95.22 when it is trained on human-annotated data.  Our best system which does not depend on any type of annotation is based on word embeddings and scores 92.04 accuracy.  Finally, we perform an experiment in which we asked humans to predict whether a sentence was written by a human or generated by our best model.  This experiment showed that in most cases it is hard for a human to detect automatically generated sentences.


A Corpus of German Reddit Exchanges (GeRedE)

Andreas Blombach, Natalie Dykes, Philipp Heinrich, Besim Kabashi and Thomas Proisl

GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. Starting from a large, freely available data set, the paper describes our approach to filter out German data and further pre-processing steps, as well as which metadata and annotation layers have been included so far. We explore the Reddit sphere, what makes the German data linguistically peculiar, and how some of the communities within Reddit differ from one another. The CWB-indexed version of our final corpus is available via CQPweb, and all our processing scripts as well as all manual annotation and automatic language classification can be downloaded from GitHub.


French Tweet Corpus for Automatic Stance Detection

Marc Evrard, Rémi Uro, Nicolas Hervé and Béatrice Mazoyer

The automatic stance detection task consists in determining the attitude expressed in a text toward a target (text, claim, or entity). This is a typical intermediate task for the fake news detection or analysis, which is a considerably widespread and a particularly difficult issue to overcome. This work aims at the creation of a human-annotated corpus for the automatic stance detection of tweets written in French. It exploits a corpus of tweets collected during July and August 2018. To the best of our knowledge, this is the first freely available stance annotated tweet corpus in the French language. The four classes broadly adopted by the community were chosen for the annotation: support, deny, query, and comment with the addition of the ignore class. This paper presents the corpus along with the tools used to build it, its construction, an analysis of the inter-rater reliability, as well as the challenges and questions that were raised during the building process.


LSCP: Enhanced Large Scale Colloquial Persian Language Understanding

Hadi Abdi Khojasteh, Ebrahim Ansari and Mahdi Bohlouli

Language recognition has been significantly advanced in recent years by means of modern machine learning methods such as deep learning and benchmarks with rich annotations. However, research is still limited in low-resource formal languages. This consists of a significant gap in describing the colloquial language especially for low-resourced ones such as Persian. In order to target this gap for low resource languages, we propose a "Large Scale Colloquial Persian Dataset" (LSCP). LSCP is hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the real-world sentences. We believe that further investigations and processing, as well as the application of novel algorithms and methods, can strengthen enriching computerized understanding and processing of low resource languages. The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.

Speech Recognition and Synthesis

Back to Top


Burmese Speech Corpus, Finite-State Text Normalization and Pronunciation Grammars with an Application to Text-to-Speech

Yin May Oo, Theeraphol Wattanavekin, Chenfang Li, Pasindu De Silva, Supheakmungkol Sarin, Knot Pipatsrisawat, Martin Jansche, Oddur Kjartansson and Alexander Gutkin

This paper introduces an open-source crowd-sourced multi-speaker speech corpus along with the comprehensive set of finite-state transducer (FST) grammars for performing text normalization for the Burmese (Myanmar) language. We also introduce the open-source finite-state grammars for performing grapheme-to-phoneme (G2P) conversion for Burmese. These three components are necessary (but not sufficient) for building a high-quality text-to-speech (TTS) system for Burmese, a tonal Southeast Asian language from the Sino-Tibetan family which presents several linguistic challenges. We describe the corpus acquisition process and provide the details of our finite state-based approach to Burmese text normalization and G2P. Our experiments involve building a multi-speaker TTS system based on long short term memory (LSTM) recurrent neural network (RNN) models, which were previously shown to perform well for other languages in a low-resource setting. Our results indicate that the data and grammars that we are announcing are sufficient to build reasonably high-quality models comparable to other systems. We hope these resources will facilitate speech and language research on the Burmese language, which is considered by many to be low-resource due to the limited availability of free linguistic data.


Evaluating and Improving Child-Directed Automatic Speech Recognition

Eric Booth, Jake Carns, Casey Kennington and Nader Rafla

Speech recognition has seen dramatic improvements in the last decade, though those improvements have focused primarily on adult speech. In this paper, we assess child-directed speech recognition and leverage a transfer learning approach to improve child-directed speech recognition by training the recent DeepSpeech2 model on adult data, then apply additional tuning to varied amounts of child speech data. We evaluate our model using the CMU Kids dataset as well as our own recordings of child-directed prompts. The results from our experiment show that even a small amount of child audio data improves significantly over a baseline of adult-only or child-only trained models. We report a final general Word-Error-Rate of 29% over a baseline of 62% that uses the adult-trained model. Our analyses show that our model adapts quickly using a small amount of data and that the general child model works better than school grade-specific models. We make available our trained model and our data collection tool.


Parallel Corpus for Japanese Spoken-to-Written Style Conversion

Mana Ihori, Akihiko Takashima and Ryo Masumura

With the increase of automatic speech recognition (ASR) applications, spoken-to-written style conversion that transforms spoken-style text into written-style text is becoming an important technology to increase the readability of ASR transcriptions. To establish such conversion technology, a parallel corpus of spoken-style text and written-style text is beneficial because it can be utilized for building end-to-end neural sequence transformation models. Spoken-to-written style conversion involves multiple conversion problems including punctuation restoration, disfluency detection, and simplification. However, most existing corpora tend to be made for just one of these conversion problems. In addition, in Japanese, we have to consider not only general spoken-to-written style conversion problems but also Japanese-specific ones, such as language style unification (e.g., polite, frank, and direct styles) and omitted postpositional particle expressions restoration. Therefore, we created a new Japanese parallel corpus of spoken-style text and written-style text that can simultaneously handle general problems and Japanese-specific ones. To make this corpus, we prepared four types of spoken-style text and utilized a crowdsourcing service for manually converting them into written-style text. This paper describes the building setup of this corpus and reports the baseline results of spoken-to-written style conversion using the latest neural sequence transformation models.


Multi-Staged Cross-Lingual Acoustic Model Adaption for Robust Speech Recognition in Real-World Applications - A Case Study on German Oral History Interviews

Michael Gref, Oliver Walter, Christoph Schmidt, Sven Behnke and Joachim Köhler

While recent automatic speech recognition systems achieve remarkable performance when large amounts of adequate, high quality annotated speech data is used for training, the same systems often only achieve an unsatisfactory result for tasks in domains that greatly deviate from the conditions represented by the training data. For many real-world applications, there is a lack of sufficient data that can be directly used for training robust speech recognition systems. To address this issue, we propose and investigate an approach that performs a robust acoustic model adaption to a target domain in a cross-lingual, multi-staged manner. Our approach enables the exploitation of large-scale training data from other domains in both the same and other languages. We evaluate our approach using the challenging task of German oral history interviews, where we achieve a relative reduction of the word error rate by more than 30% compared to a model trained from scratch only on the target domain, and 6-7% relative compared to a model trained robustly on 1000 hours of same-language out-of-domain training data.


Large Corpus of Czech Parliament Plenary Hearings

Jonas Kratochvil, Peter Polak and Ondrej Bojar

We present a large corpus of Czech parliament plenary sessions.  The corpus consists of approximately 1200 hours of speech data and corresponding text transcriptions.  The whole corpus has been segmented to short audio segments making it suitable for both training and evaluation of automatic speech recognition (ASR) systems.  The source language of the corpus is Czech, which makes it a valuable resource for future research as only a few public datasets are available in the Czech language. We complement the data release with experiments of two baseline ASR systems trained on the presented data:  the more traditional approach implemented in the Kaldi ASRtoolkit which combines hidden Markov models and deep neural networks (NN) and a modern ASR architecture implemented in Jaspertoolkit which uses deep NNs in an end-to-end fashion.


Augmented Prompt Selection for Evaluation of Spontaneous Speech Synthesis

Eva Szekely, Jens Edlund and joakim gustafson

By definition, spontaneous speech is unscripted and created on the fly by the speaker. It is dramatically different from read speech, where the words are authored as text before they are spoken. Spontaneous speech is emergent and transient, whereas text read out loud is pre-planned. For this reason, it is unsuitable to evaluate the usability and appropriateness of spontaneous speech synthesis by having it read out written texts sampled from for example newspapers or books. Instead, we need to use transcriptions of speech as the target - something that is much less readily available. In this paper, we introduce Starmap, a tool allowing developers to select a varied, representative set of utterances from a spoken genre, to be used for evaluation of TTS for a given domain. The selection can be done from any speech recording, without the need for transcription. The tool uses interactive visualisation of prosodic features with t-SNE, along with a tree-based algorithm to guide the user through thousands of utterances and ensure coverage of a variety of prompts. A listening test has shown that with a selection of genre-specific utterances, it is possible to show significant differences across genres between two synthetic voices built from spontaneous speech.


ATC-ANNO: Semantic Annotation for Air Traffic Control with Assistive Auto-Annotation

Marc Schulder, Johannah O’Mahony, Yury Bakanouski and Dietrich Klakow

In air traffic control, assistant systems support air traffic controllers in their work. To improve the reactivity and accuracy of the assistant, automatic speech recognition can monitor the commands uttered by the controller. However, to provide sufficient training data for the speech recognition system, many hours of air traffic communications have to be transcribed and semantically annotated. For this purpose we developed the annotation tool ATC-ANNO. It provides a number of features to support the annotator in their task, such as auto-complete suggestions for semantic tags, access to preliminary speech recognition predictions, syntax highlighting and consistency indicators. Its core assistive feature, however, is its ability to automatically generate semantic annotations. Although it is based on a simple hand-written finite state grammar, it is also able to annotate sentences that deviate from this grammar. We evaluate the impact of different features on annotator efficiency and find that automatic annotation allows annotators to cover four times as many utterances in the same time.


MASRI-HEADSET: A Maltese Corpus for Speech Recognition

Carlos Daniel Hernandez Mena, Albert Gatt, Andrea DeMarco, Claudia Borg, Lonneke van der Plas, Amanda Muscat and Ian Padovani

Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI HEADSET Corpus is publicly available for research/academic purposes.


Automatic Period Segmentation of Oral French

Natalia Kalashnikova, Loïc Grobol, Iris Eshkol-Taravella and François Delafontaine

Natural Language Processing in oral speech segmentation is still looking for a minimal unit to analyze. In this work, we present a comparison of two automatic segmentation methods of macro-syntactic periods which allows to take into account syntactic and prosodic components of speech. We compare the performances of an existing tool Analor (Avanzi, Lacheret-Dujour, Victorri, 2008) developed for automatic segmentation of prosodic periods and of CRF models relying on syntactic and / or prosodic features. We find that Analor tends to divide speech into smaller segments and that CRF models detect larger segments rather than macro-syntactic periods. However, in general CRF models perform better results than Analor in terms of F-measure.


Corpus Generation for Voice Command in Smart Home and the Effect of Speech Synthesis on End-to-End SLU

Thierry Desot, François Portet and Michel Vacher

Massive amounts of annotated data greatly contributed to the advance of the machine learning field.  However such large data sets are often unavailable for novel tasks performed in realistic environments such as smart homes. In this domain, semantically annotated large voice command corpora for Spoken Language Understanding (SLU) are scarce, especially for non-English languages.  We present the automatic generation process of a synthetic semantically-annotated corpus of French commands for smart-home to train pipeline and End-to-End  (E2E)  SLU  models.   SLU  is  typically  performed  through  Automatic  Speech  Recognition  (ASR)  and  Natural  Language Understanding (NLU) in a pipeline. Since errors at the ASR stage reduce the NLU performance, an alternative approach is End-to-End (E2E) SLU to jointly perform ASR and NLU. To that end, the artificial corpus was fed to a text-to-speech (TTS) system to generate synthetic speech data. All models were evaluated on voice commands acquired in a real smart home. We show that artificial data can be combined with real data within the same training set or used as a stand-alone training corpus. The synthetic speech quality was assessedby comparing it to real data using dynamic time warping (DTW).


Text and Speech-based Tunisian Arabic Sub-Dialects Identification

Najla Ben Abdallah, Saméh Kchaou and Fethi Bougares

Dialect IDentification (DID) is a challenging task, and it becomes more complicated when it is about the identification of dialects that belong to the same country. Indeed, dialects of the same country are closely related and exhibit a significant overlapping at the phonetic and lexical levels. In this paper, we present our first results on a dialect classification task covering four sub-dialects spoken in Tunisia. We use the term ’sub-dialect’ to refer to the dialects belonging to the same country. We conducted our experiments aiming to discriminate between Tunisian sub-dialects belonging to four different cities: namely Tunis, Sfax, Sousse and Tataouine. A spoken corpus of 1673 utterances is collected, transcribed and freely distributed. We used this corpus to build several speech- and text-based DID systems. Our results confirm that, at this level of granularity, dialects are much better distinguishable using the speech modality. Indeed, we were able to reach an F-1 score of 93.75% using our best speech-based identification system while the F-1 score is limited to 54.16% using text-based DID on the same test set.


Urdu Pitch Accents and Intonation Patterns in Spontaneous Conversational Speech

Luca Rognoni, Judith Bishop, Miriam Corris, Jessica Fernando and Rosanna Smith

An intonational inventory of Urdu for spontaneous conversational speech is determined based on the analysis of a hand-labelled data set of telephone conversations. An inventory of Urdu pitch accents and the basic Urdu intonation patterns observed in the data are summarised and presented using a simplified version of the Rhythm and Pitch (RaP) labelling system. The relation between pitch accents and parts of speech (PoS) is also explored. The data confirm the important role played by low pitch accents in Urdu spontaneous speech, in line with previous studies on Urdu/Hindi scripted speech. Typical pitch contours such as falling tone in statements and WH-questions, and rising tone for yes/no questions are also exhibited. Pitch accent distribution is quite free in Urdu, but the data indicate a stronger association of pitch accent with some PoS categories of content word (e.g. Nouns) when compared with function words and semantically lighter PoS categories (such as Light Verbs).  Contrastive focus is realised by an L*+H accent with a relatively large pitch excursion for the +H tone, and longer duration of the stressed syllable. The data suggest that post-focus compression (PFC) is used in Urdu as a focus-marking strategy.


IndicSpeech: Text-to-Speech Corpus for Indian Languages

Nimisha Srivastava, Rudrabha Mukhopadhyay, Prajwal K R and C V Jawahar

India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a $24$ hour text-to-speech corpus for $3$ major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances. The collected corpus, code, and trained models are made publicly available.


Using Automatic Speech Recognition in Spoken Corpus Curation

Jan Gorisch, Michael Gref and Thomas Schmidt

The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores for the south. A detailed analysis of the narrow region data revealed -- despite relatively high ASR-confidence -- some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.


Integrating Disfluency-based and Prosodic Features with Acoustics in Automatic Fluency Evaluation of Spontaneous Speech

Huaijin Deng, Youchao Lin, Takehito Utsuro, Akio Kobayashi, Hiromitsu Nishizaki and Junichi Hoshino

This paper describes an automatic fluency evaluation of spontaneous speech. In the task of automatic fluency evaluation, we integrate diverse features of acoustics, prosody, and disfluency-based ones. Then, we attempt to reveal the contribution of each of those diverse features to the task of automatic fluency evaluation. Although a variety of different disfluencies are observed regularly in spontaneous speech, we focus on two types of phenomena, i.e., filled pauses and word fragments.  The experimental results demonstrate that the disfluency-based features derived from word fragments and filled pauses are effective relative to evaluating fluent/disfluent speech, especially when combined with prosodic features, e.g., such as speech rate and pauses/silence.  Next, we employed an LSTM based framework in order to integrate the disfluency-based and prosodic features with time sequential acoustic features.  The experimental evaluation results of those integrated diverse features indicate that time sequential acoustic features contribute to improving the model with disfluency-based and prosodic features when detecting fluent speech, but not when detecting disfluent speech.  Furthermore, when detecting disfluent speech, the model without time sequential acoustic features performs best even without word fragments features, but only with filled pauses and prosodic features.


DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus

Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura and Hiroshi Saruwatari

In this paper, we investigate the effectiveness of using rich annotations in deep neural network (DNN)-based statistical speech synthesis. DNN-based frameworks typically use linguistic information as input features called context instead of directly using text. In such frameworks, we can synthesize not only reading-style speech but also speech with paralinguistic and nonlinguistic features by adding such information to the context. However, it is not clear what kind of information is crucial for reproducing paralinguistic and nonlinguistic features. Therefore, we investigate the effectiveness of rich tags in DNN-based speech synthesis according to the Corpus of Spontaneous Japanese (CSJ), which has a large amount of annotations on paralinguistic features such as prosody, disfluency, and morphological features. Experimental evaluation results shows that the reproducibility of paralinguistic features of synthetic speech was enhanced by adding such information as context.


Automatic Speech Recognition for Uyghur through Multilingual Acoustic Modeling

Ayimunishagu Abulimiti and Tanja Schultz

Low-resource languages suffer from lower performance of Automatic Speech Recognition (ASR) system due to the lack of data. As a common approach, multilingual training has been applied to achieve more context coverage and has shown better performance over the monolingual training (Heigold et al., 2013). However, the difference between the donor language and the target language may distort the acoustic model trained with multilingual data, especially when much larger amount of data from donor languages is used for training the models of low-resource language. This paper presents our effort towards improving the performance of ASR system for the under-resourced Uyghur language with multilingual acoustic training. For the developing of multilingual speech recognition system for Uyghur, we used Turkish as donor language, which we selected from GlobalPhone corpus as the most similar language to Uyghur. By generating subsets of Uyghur training data, we explored the performance of multilingual speech recognition systems trained with different sizes of Uyghur and Turkish data. The best speech recognition system for Uyghur is achieved by multilingual training using all Uyghur data (10hours) and 17 hours of Turkish data and the WER is 19.17%, which corresponds to 4.95% relative improvement over monolingual training.


The SAFE-T Corpus: A New Resource for Simulated Public Safety Communications

Dana Delgado, Kevin Walker, Stephanie Strassel, Karen Jones, Christopher Caruso and David Graff

We introduce a new resource, the SAFE-T (Speech Analysis for Emergency Response Technology) Corpus, designed to simulate first-responder communications by inducing high vocal effort and urgent speech with situational background noise in a game-based collection protocol. Linguistic Data Consortium developed the SAFE-T Corpus to support the NIST (National Institute of Standards and Technology) OpenSAT (Speech Analytic Technologies) evaluation series, whose goal is to advance speech analytic technologies including automatic speech recognition, speech activity detection and keyword search in multiple domains including simulated public safety communications data. The corpus comprises over 300 hours of audio from 115 unique speakers engaged in a collaborative problem-solving activity representative of public safety communications in terms of speech content, noise types and noise levels. Portions of the corpus have been used in the OpenSAT 2019 evaluation and the full corpus will be published in the LDC catalog. We describe the design and implementation of the SAFE-T Corpus collection, discuss the approach of capturing spontaneous speech from study participants through game-based speech collection, and report on the collection results including several challenges associated with the collection.


Lexical Tone Recognition in Mizo using Acoustic-Prosodic Features

Parismita Gogoi, Abhishek Dey, Wendy Lalhminghlui, Priyankoo Sarmah and S R Mahadeva Prasanna

Mizo is an under-studied Tibeto-Burman tonal language of the North-East India.   Preliminary research findings have confirmed that four distinct tones of Mizo (High, Low, Rising and Falling) appear in the language.  In this work, an attempt is made to automatically recognize four phonological tones in Mizo distinctively using acoustic-prosodic parameters as features.  Six features computed from Fundamental  Frequency  (F0)  contours  are  considered  and  two  classifier  models  based  on  Support  Vector  Machine  (SVM)  &  Deep Neural  Network  (DNN) are implemented for automatic tonerecognition task respectively.   The  Mizo  database  consists  of 31950 iterations of the four Mizo tones, collected from 19 speakers using trisyllabic phrases.  A four-way classification of tones is attempted with a balanced (equal number of iterations per tone category) dataset for each tone of Mizo. it is observed that the DNN based classifier shows comparable performance in correctly recognizing four phonological Mizo tones as of the SVM based classifier.


Artie Bias Corpus: An Open Dataset for Detecting Demographic Bias in Speech Applications

Josh Meyer, Lindy Rauchenstein, Joshua D. Eisenberg and Nicholas Howell

We describe the creation of the Artie Bias Corpus, an English dataset of expert­-validated <audio, transcript> pairs with demographic tags for {age, gender, accent}. We also release open software which may be used with the Artie Bias Corpus to detect demographic bias in Automatic Speech Recognition systems, and can be extended to other speech technologies. The Artie Bias Corpus is a curated subset of the Mozilla Common Voice corpus, which we release under a Creative Commons CC­0 license – the most open and permissive license for data. This article contains information on the criteria used to select and annotate the Artie Bias Corpus in addition to experiments in which we detect and attempt to mitigate bias in end­-to­-end speech recognition models. We we observe a significant accent bias in our baseline DeepSpeech model, with more accurate transcriptions of US English compared to Indian English. We do not, however, find evidence for a significant gender bias. We then show significant improvements on individual demographic groups from fine­-tuning.


Evaluation of Off-the-shelf Speech Recognizers Across Diverse Dialogue Domains

Kallirroi Georgila, Anton Leuski, Volodymyr Yanov and David Traum

We evaluate several publicly available off-the-shelf (commercial and research) automatic speech recognition (ASR) systems across diverse dialogue domains (in US-English). Our evaluation is aimed at non-experts with limited experience in speech recognition. Our goal is not only to compare a variety of ASR systems on several diverse data sets but also to measure how much ASR technology has advanced since our previous large-scale evaluations on the same data sets. Our results show that the performance of each speech recognizer can vary significantly depending on the domain. Furthermore, despite major recent progress in ASR technology, current state-of-the-art speech recognizers perform poorly in domains that require special vocabulary and language models, and under noisy conditions. We expect that our evaluation will prove useful to ASR consumers and dialogue system designers.


CEASR: A Corpus for Evaluating Automatic Speech Recognition

Malgorzata Anna Ulasik, Manuela Hürlimann, Fabian Germann, Esin Gedik, Fernando Benites and Mark Cieliebak

In this paper, we present CEASR, a Corpus for Evaluating the quality of Automatic Speech Recognition (ASR). It is a data set based on public speech corpora, containing metadata along with transcripts generated by several modern state-of-the-art ASR systems. CEASR provides this data in a unified structure, consistent across all corpora and systems, with normalised transcript texts and metadata. We use CEASR to evaluate the quality of ASR systems by calculating an average Word Error Rate (WER) per corpus, per system and per corpus-system pair. Our experiments show a substantial difference in accuracy between commercial versus open-source ASR tools as well as differences up to a factor ten for single systems on different corpora. Using CEASR allowed us to very efficiently and easily obtain these results. Our corpus enables researchers to perform ASR-related evaluations and various in-depth analyses with noticeably reduced effort, i.e. without the need to collect, process and transcribe the speech data themselves.



Speech Resource/Database

Speech Resource/Database

Back to Top

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

Marcely Zanon Boito, William Havard, Mahault Garnerin, Éric Le Ferrand and Laurent Besacier

The CMU Wilderness Multilingual Speech Dataset (Black, 2019) is a newly published multilingual speech dataset based on recorded readings of the New Testament.  It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible) is the same for all the languages is not exploited to date.Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances).  The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for typologically different language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs). Lastly, we showcase the usefulness of the final product on a bilingual speech retrieval task.


Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems

Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson, Clara Rivera, Anna Katanova, Alexander Gutkin, Isin Demirsahin, Cibu Johny, Martin Jansche, Supheakmungkol Sarin and Knot Pipatsrisawat

We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India spoken by 374 million native speakers. The datasets are primarily intended for use in text-to-speech (TTS) applications, such as constructing multilingual voices or being used for speaker or language adaptation. Most of the corpora (apart from Marathi, which is a female-only database) consist of at least 2,000 recorded lines from female and male native speakers of the language. We present the methodological details behind corpora acquisition, which can be scaled to acquiring data for other languages of interest. We describe the experiments in building a multilingual text-to-speech model that is constructed by combining our corpora. Our results indicate that using these corpora results in good quality voices, with Mean Opinion Scores (MOS) > 3.6, for all the languages tested. We believe that these resources, released with an open-source license, and the described methodology will help in the progress of speech applications for the languages described and aid corpora development for other, smaller, languages of India and beyond.


Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech

Adriana Guevara-Rukoz, Isin Demirsahin, Fei He, Shan-Hui Cathy Chu, Supheakmungkol Sarin, Knot Pipatsrisawat, Alexander Gutkin, Alena Butryna and Oddur Kjartansson

In this paper we present a multidialectal corpus approach for building a text-to-speech voice for a new dialect in a language with existing resources, focusing on various South American dialects of Spanish. We first present public speech datasets for Argentinian, Chilean, Colombian, Peruvian, Puerto Rican and Venezuelan Spanish specifically constructed with text-to-speech applications in mind using crowd-sourcing. We then compare the monodialectal voices built with minimal data to a multidialectal model built by pooling all the resources from all dialects. Our results show that the multidialectal model outperforms the monodialectal baseline models. We also experiment with a ``zero-resource'' dialect scenario where we build a multidialectal voice for a dialect while holding out target dialect recordings from the training data.


A Manually Annotated Resource for the Investigation of Nasal Grunts

Aurélie Chlébowski and Nicolas Ballier

This paper presents an annotation framework for nasal grunts of the whole French CID corpus (Bertrand et al., 2008). The acoustic components under scrutiny are justified and the annotation guidelines are described. We carefully characterise the acoustic cues and visual cues followed by the annotator, especially for non-modal phonation types. The conventions followed for the annotation of interactional and positional properties of grunts are explained. The resulting datasets after data extraction with Praat scripts (Boersma and Weenink, 2019) are analysed with R (R Core Team, 2017), focusing on duration. We analyse the effect of non-modal phonation (especially ingressive phonation) on duration and discuss a specialisation of grunts observed in the CID for grunts with ingressive phonation. The more general aim of this research is to establish putative core and additive properties of grunts and a tentative typology of grunts in spoken interactions.


The Objective and Subjective Sleepiness Voice Corpora

Vincent P. Martin, Jean-Luc Rouas, Jean-Arthur Micoulaud Franchi and Pierre Philip

Following patients with chronic sleep disorders involves multiple appointments between doctors and patients which often results in episodic follow-ups with unevenly spaced interviews. Speech technologies and virtual doctors can help improve this follow-up. However, there are still some challenges to overcome: sleepiness measurements are diverse and are not always correlated, and most past research focused on detecting  nstantaneous sleepiness levels of healthy sleep-deprived subjects. This article presents a large database to assess the sleepiness level of highly phenotyped patients that complain from excessive daytime sleepiness. Based on the Multiple Sleep Latency Test, it differs from existing databases by multiple aspects. First, it is omposed of recordings from patients suffering from excessive daytime sleepiness instead of sleep deprived healthy subjects. Second, it incites the subjects to sleep contrary to existing stressing sleepiness deprivation experimental paradigms. Third, the sleepiness level of the patients is evaluated with different temporal granularities - long term sleepiness and short term sleepiness - and both objective and subjective sleepiness measures are collected. Finally, it relies on the recordings of 94 highly phenotyped patients, allowing to unravel the influences of different physical factors (age, sex, weight, ... ) on voice.


Open-source Multi-speaker Corpora of the English Accents in the British Isles

Isin Demirsahin, Oddur Kjartansson, Alexander Gutkin and Clara Rivera

This paper presents a dataset of transcribed high-quality audio of English sentences recorded by volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words. Overlapping lines for all speakers were included for idiolect elicitation, which include the same or similar lines with other existing resources such as the CSTR VCTK corpus and the Speech Accent Archive to allow for easy comparison of personal and regional accents. The resulting corpora include over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.


TV-AfD: An Imperative-Annotated Corpus from The Big Bang Theory and Wikipedia’s Articles for Deletion Discussions

Yimin Xiao, Zong-Ying Slaton and Lu Xiao

In this study, we created an imperative corpus with speech conversations from dialogues in The Big Bang Theory and with the written comments in Wikipedia’s Articles for Deletion discussions. For the TV show data, 59 episodes containing 25,076 statements are used. We manually annotated imperatives based on the annotation guideline adapted from Condoravdi and Lauer’s study (2012) and used the retrieved data to assess the performance of syntax-based classification rules. For the Wikipedia AfD comments data, we first developed and leveraged a syntax-based classifier to extract 10,624 statements that may be imperative, and we manually examined the statements and then identified true positives. With this corpus, we also examined the performance of the rule-based imperative detection tool. Our result shows different outcomes for speech (dialogue) and written data. The rule-based classification performs better in the written data in precision (0.80) compared to the speech data (0.44). Also, the rule-based classification has a low-performance overall for speech data with the precision of 0.44, recall of 0.41, and f-1 measure of 0.42. This finding implies the syntax-based model may need to be adjusted for a speech dataset because imperatives in oral communication have greater syntactic varieties and are highly context-dependent.


A Large Scale Speech Sentiment Corpus

Eric Chen, Zhiyun Lu, Hao Xu, Liangliang Cao, Yu Zhang and James Fan

We present a multimodal corpus for sentiment analysis based on the existing Switchboard-1 Telephone Speech Corpus released by the Linguistic Data Consortium. This corpus extends the Switchboard-1 Telephone Speech Corpus by adding sentiment labels from 3 different human annotators for every transcript segment. Each sentiment label can be one of three options: positive, negative, and neutral. Annotators are recruited using Google Cloud's data labeling service and the labeling task was conducted over the internet. The corpus contains a total of 49500 labeled speech segments covering 140 hours of audio. To the best of our knowledge, this is the largest multimodal Corpus for sentiment analysis that includes both speech and text features.


SibLing Corpus of Russian Dialogue Speech Designed for Research on Speech Entrainment

Tatiana Kachkovskaia, Tatiana Chukaeva, Vera Evdokimova, Pavel Kholiavin, Natalia Kriakina, Daniil Kocharov, Anna Mamushina, Alla Menshikova and Svetlana Zimina

The paper presents a new corpus of dialogue speech designed specifically for research in the field of speech entrainment. Given that the degree of accommodation may depend on a number of social factors, the corpus is designed to encompass 5 types of relations between the interlocutors: those between siblings, close friends, strangers of the same gender, strangers of the other gender, strangers of which one has a higher job position and greater age. Another critical decision taken in this corpus is that in all these social settings one speaker is kept the same. This allows us to trace the changes in his/her speech depending on the interlocutor. The basic set of speakers consists of 10 pairs of same-gender siblings (including 4 pairs of identical twins) aged 23-40, and each of them was recorded in the 5 settings mentioned above. In total we obtained 90 dialogues of 25-60 minutes each. The speakers played a card game and a map game; they were recorded in a soundproof studio without being able to see each other due to a non-transparent screen between them. The corpus contains orthographic, phonetic and prosodic annotation and is segmented into turns and inter-pausal units.


PhonBank and Data Sharing: Recent Developments in European Portuguese

Ana Margarida Ramalho, Maria João Freitas and Yvan Rose

This paper presents the recently published RAMALHO-EP and PHONODIS  corpora.  Both  include  European  Portuguese  production  data from Portuguese children with typical (RAMALHO-EP) and protracted (PHONODIS) phonological development. The data in the two  corpora  were  collected  using  the  phonological  assessment  tool  CLCP-EP,  developed  in  the  context  of  the  Crosslinguistic  Child  Phonology  Project,  coordinated  by  Barbara  Bernhardt  and  Joe  Stemberger  (University  of  British  Columbia  (UBC),  Canada).    Both  corpora  are  part  of  the  PhonBank  Project  (Brian  MacWhinney  (Carnegie  Mellon,  USA)  and  Yvan  Rose  (Memorial  University  of  Newfoundland,  Canada),  which  is  the  child  phonology  component  of  TalkBank,  coordinated  by  Brian  MacWhinney.  The  data  at  PhonBank  is  edited  in  Phon  format,  a  language  tool  designed  and  built  by  Yvan  Rose  and  Greg  Hedlund  (Memorial  University  of  Newfoundland) and widely used by researchers working in the field of phonological acquisition. RAMALHO-EP contains production data  from  87  typically  developing  children,  aged  2;11  to  6;04,  all  monolinguals.  PHONODIS  includes  production data  from  22  children  diagnosed  with  different  types  of  speech  and  language  disorders,  all  EP  monolinguals,  aged  3;2  to  11,05.  Both  corpora  are  open  access  language  resources  and  contribute  to  enlarge  the  amount  of  production  data  on  the  acquisition  of  European  Portuguese  available in PhonBank.


SMASH Corpus: A Spontaneous Speech Corpus Recording Third-person Audio Commentaries on Gameplay

Yuki Saito, Shinnosuke Takamichi and Hiroshi Saruwatari

Developing a spontaneous speech corpus would be beneficial for spoken language processing and understanding. We present a speech corpus named the SMASH corpus, which includes spontaneous speech of two Japanese male commentators that made third-person audio commentaries during the gameplay of a fighting game. Each commentator ad-libbed while watching the gameplay with various topics covering not only explanations of each moment to convey the information on the fight but also comments to entertain listeners. We made transcriptions and topic tags as annotations on the recorded commentaries with our two-step method. We first made automatic and manual transcriptions of the commentaries and then manually annotated the topic tags. This paper describes how we constructed the SMASH corpus and reports some results of the annotations.


Improving Speech Recognition for the Elderly: A New Corpus of Elderly Japanese Speech and Investigation of Acoustic Modeling for Speech Recognition

Meiko Fukuda, Hiromitsu Nishizaki, Yurie Iribe, Ryota Nishimura and Norihide Kitaoka

In an aging society like Japan, a highly accurate speech recognition system is needed for use in electronic devices for the elderly, but this level of accuracy cannot be obtained using conventional speech recognition systems due to the unique features of the speech of elderly people. S-JNAS, a corpus of elderly Japanese speech, is widely used for acoustic modeling in Japan, but the average age of its speakers is 67.6 years old. Since average life expectancy in Japan is now 84.2 years, we are constructing a new speech corpus, which currently consists of the utterances of 221 speakers with an average age of 79.2, collected from four regions of Japan. In addition, we expand on our previous study (Fukuda, 2019) by further investigating the construction of acoustic models suitable for elderly speech. We create new acoustic models and train them using a combination of existing Japanese speech corpora (JNAS, S-JNAS, CSJ), with and without our ‘super-elderly’ speech data, and conduct speech recognition experiments. Our new acoustic models achieve word error rates (WER) as low as 13.38%, exceeding the results of our previous study in which we used the CSJ acoustic model adapted for elderly speech (17.4% WER).


Preparation of Bangla Speech Corpus from Publicly Available Audio & Text

Shafayat Ahmed, Nafis Sadeq, Sudipta Saha Shubha, Md. Nahidul Islam, Muhammad Abdullah Adnan and Mohammad Zuberul Islam

Automatic speech recognition systems require large annotated speech corpus. The manual annotation of a large corpus is very difficult. In this paper, we focus on the automatic preparation of a speech corpus for Bangladeshi Bangla. We have used publicly available Bangla audiobooks and TV news recordings as audio sources. We designed and implemented an iterative algorithm that takes as input a speech corpus and a huge amount of raw audio (without transcription) and outputs a much larger speech corpus with reasonable confidence. We have leveraged speaker diarization, gender detection, etc. to prepare the annotated corpus. We also have prepared a synthetic speech corpus for handling out-of-vocabulary word problems in Bangla language. Our corpus is suitable for training with Kaldi. Experimental results show that the use of our corpus in addition to the Google Speech corpus (229 hours) significantly improves the performance of the ASR system.


On Construction of the ASR-oriented Indian English Pronunciation Dictionary

Xian Huang, Xin Jin, Qike Li and Keliang Zhang

As a World English, a New English and a regional variety of English, Indian English (IE) has developed its own distinctive characteristics, especially phonologically, from other varieties of English. An Automatic Speech Recognition (ASR) system simply trained on British English (BE) /American English (AE) speech data and using the BE/AE pronunciation dictionary performs much worse when applied to IE. An applicable IEASR system needs spontaneous IE speech as training materials and a comprehensive, linguistically-guided IE pronunciation dictionary (IEPD) so as to achieve the effective mapping between the acoustic model and language model. This research builds a small IE spontaneous speech corpus, analyzes and summarizes the phonological variation features of IE, comes up with an IE phoneme set and complies the IEPD (including a common-English-word list, an Indian-word list, an acronym list and an affix list). Finally, two ASR systems are trained with 120 hours IE spontaneous speech data, using the IEPD we construct in this study and CMUdict separately. The two systems are tested with 50 audio clips of IE spontaneous speech. The result shows the system trained with IEPD performs better than the one trained with CMUdict with WER being 15.63% lower on the test data.


Gender Representation in Open Source Speech Resources

Mahault Garnerin, Solange Rossato and Laurent Besacier

With the rise of artificial intelligence (AI) and the growing use of deep-learning architectures, the question of ethics, transparency and fairness of AI systems has become a central concern within the research community. We address transparency and fairness in spoken language systems by proposing a study about  gender representation in speech resources available through the Open Speech and Language Resource platform. We show that finding gender information in open source corpora is not straightforward and that gender balance depends on other corpus characteristics (elicited/non elicited speech, low/high resource language, speech task targeted). The paper ends with recommendations about metadata and gender information for researchers in order to assure better transparency of the speech systems built using such corpora.


RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition

Alexandru-Lucian Georgescu, Horia Cucu, Andi Buzo and Corneliu Burileanu

Although many efforts have been made in the last decade to enhance the speech and language resources for Romanian, this language is still considered under-resourced. While for many other languages there are large speech corpora available for research and commercial applications, for Romanian language the largest publicly available corpus to date comprises less than 50 hours of speech. In this context, Speech and Dialogue research group releases Read Speech Corpus (RSC) – a Romanian speech corpus developed in-house, comprising 100 hours of speech recordings from 164 different speakers. The paper describes the development of the corpus and presents baseline automatic speech recognition (ASR) results using state-of-the-art ASR technology: Kaldi speech recognition toolkit.


FAB: The French Absolute Beginner Corpus for Pronunciation Training

Sean Robertson, Cosmin Munteanu and Gerald Penn

We introduce the French Absolute Beginner (FAB) speech corpus. The corpus is intended for the development and study of Computer-Assisted Pronunciation Training (CAPT) tools for absolute beginner learners. Data were recorded during two experiments focusing on using a CAPT system in paired role-play tasks. The setting grants FAB three distinguishing features from other non-native corpora: the experimental setting is ecologically valid, closing the gap between training and deployment; it features a label set based on teacher feedback, allowing for context-sensitive CAPT; and data have been primarily collected from absolute beginners, a group often ignored. Participants did not read prompts, but instead recalled and modified dialogues that were modelled in videos. Unable to distinguish modelled words solely from viewing videos, speakers often uttered unintelligible or out-of-L2 words. The corpus is split into three partitions: one from an experiment with minimal feedback; another with explicit, word-level feedback; and a third with supplementary read-and-record data. A subset of words in the first partition has been labelled as more or less native, with inter-annotator agreement reported. In the explicit feedback partition, labels are derived from the experiment's online feedback. The FAB corpus is scheduled to be made freely available by the end of 2020.


Call My Net 2: A New Resource for Speaker Recognition

Karen Jones, Stephanie Strassel, Kevin Walker and Jonathan Wright

We introduce the Call My Net 2 (CMN2) Corpus, a new resource for speaker recognition featuring Tunisian Arabic conversations between friends and family, incorporating both traditional telephony and VoIP data. The corpus contains data from over 400 Tunisian Arabic speakers collected via a custom-built platform deployed in Tunis, with each speaker making 10 or more calls each lasting up to 10 minutes. Calls include speech in various realistic and natural acoustic settings, both noisy and non-noisy. Speakers used a variety of handsets, including landline and mobile devices, and made VoIP calls from tablets or computers. All calls were subject to a series of manual and automatic quality checks, including speech duration, audio quality, language identity and speaker identity. The CMN2 corpus has been used in two NIST Speaker Recognition Evaluations (SRE18 and SRE19), and the SRE test sets as well as the full CMN2 corpus will be published in the Linguistic Data Consortium Catalog. We describe CMN2 corpus requirements, the telephone collection platform, and procedures for call collection. We review properties of the CMN2 dataset and discuss features of the corpus that distinguish it from prior SRE collection efforts, including some of the technical challenges encountered with collecting VoIP data.


DaCToR: A Data Collection Tool for the RELATER Project

Juan Hussain, Oussama Zenkri, Sebastian Stüker and Alex Waibel

Collecting domain-specific data for under-resourced languages, e.g., dialects of languages, can be very expensive, potentially financially prohibitive and taking long time. Moreover, in the case of rarely written languages, the normalization of non-canonical transcription might be another time consuming but necessary task. In order to collect domain-specific data in such circumstances in a time and cost-efficient way, collecting read data of pre-prepared texts is often a viable option. In order to collect data in the domain of psychiatric diagnosis in Arabic dialects for the project RELATER, we have prepared the data collection tool DaCToR for collecting read texts by speakers in the respective countries and districts in which the dialects are spoken. In this paper we describe our tool, its purpose within the project RELATER and the dialects which we have started to collect with the tool.


Development and Evaluation of Speech Synthesis Corpora for Latvian

Roberts Darģis, Peteris Paikens, Normunds Gruzitis, Ilze Auzina and Agate Akmane

Text to speech (TTS) systems are necessary for all languages to ensure accessibility and availability of digital language services. Recent advances in neural speech synthesis have eText to speech (TTS) systems are necessary for any language to ensure accessibility and availability of digital language services. Recent advances in neural speech synthesis have enabled the development of such systems with a data-driven approach that does not require significant development of language-specific tools. However, smaller languages often lack speech corpora that would be sufficient for training current neural TTS models, which require at least 30 hours of good quality audio recordings from a single speaker in a noiseless environment with matching transcriptions. Making such a corpus manually can be cost prohibitive. This paper presents an unsupervised approach to obtain a suitable corpus from unannotated recordings using automated speech recognition for transcription, as well as automated speaker segmentation and identification. The proposed method and software tools are applied and evaluated on a case study for developing a corpus suitable for Latvian speech synthesis based on Latvian public radio archive data.nabled the development of such systems with a data-driven approach that does not require much language-specific tool development. However, smaller languages often lack speech corpora that would be sufficient for training current neural TTS models, which require approximately 30 hours of good quality audio recordings from a single speaker in a noiseless environment with matching transcriptions. Making such a corpus manually can be cost prohibitive. This paper presents an unsupervised approach to obtain a suitable corpus from unannotated recordings using automated speech recognition for transcription, as well as automated speaker segmentation and identification. The proposed methods and software tools are applied and evaluated on a case study for developing a corpus suitable for Latvian speech synthesis based on Latvian public radio archive data.

Back to Top