Dataset Information

A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.

ABSTRACT: Emotion recognition plays an important role in human-computer interaction. Previously and currently, many studies focused on speech emotion recognition using several classifiers and feature extraction methods. The majority of such studies, however, address the problem of speech emotion recognition considering emotions solely from the perspective of a single language. In contrast, the current study extends monolingual speech emotion recognition to also cover the case of emotions expressed in several languages that are simultaneously recognized by a complete system. To address this issue, a method, which provides an effective and powerful solution to bilingual speech emotion recognition, is proposed and evaluated. The proposed method is based on a two-pass classification scheme consisting of spoken language identification and speech emotion recognition. In the first pass, the language spoken is identified; in the second pass, emotion recognition is conducted using the emotion models of the language identified. Based on deep learning and the i-vector paradigm, bilingual emotion recognition experiments have been conducted using the state-of-the-art English IEMOCAP (four emotions) and German FAU Aibo (five emotions) corpora. Two classifiers along with i-vector features were used and compared, namely, fully connected deep neural networks (DNN) and convolutional neural networks (CNN). In the case of DNN, 64.0% and 61.14% unweighted average recalls (UARs) were obtained using the IEMOCAP and FAU Aibo corpora, respectively. When using CNN, 62.0% and 59.8% UARs were achieved in the case of the IEMOCAP and FAU Aibo corpora, respectively. These results are very promising, and superior to those obtained in similar studies on multilingual or even monolingual speech emotion recognition. Furthermore, an additional baseline approach for bilingual speech emotion recognition was implemented and evaluated. In the baseline approach, six common emotions were considered, and bilingual emotion models were created, trained on data from the two languages. In this case, 51.2% and 51.5% UARs for six emotions were obtained using DNN and CNN, respectively. The results using the baseline method were reasonable and promising, showing the effectiveness of using i-vectors and deep learning in bilingual speech emotion recognition. On the other hand, the proposed two-pass method based on language identification showed significantly superior performance. Furthermore, the current study was extended to also deal with multilingual speech emotion recognition using corpora collected under similar conditions. Specifically, the English IEMOCAP, the German Emo-DB, and a Japanese corpus were used to recognize four emotions based on the proposed two-pass method. The results obtained were very promising, and the differences in UAR were not statistically significant compared to the monolingual classifiers.

SUBMITTER: Heracleous P

PROVIDER: S-EPMC6695118 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.

Heracleous Panikos P Yoneyama Akio A

PloS one 20190815 8

Emotion recognition plays an important role in human-computer interaction. Previously and currently, many studies focused on speech emotion recognition using several classifiers and feature extraction methods. The majority of such studies, however, address the problem of speech emotion recognition considering emotions solely from the perspective of a single language. In contrast, the current study extends monolingual speech emotion recognition to also cover the case of emotions expressed in seve ...[more]

PMID: 31415592

Similar Datasets

Project description:Speech Emotion Recognition (SER) identifies and categorizes emotional states by analyzing speech signals. SER is an emerging research area using machine learning and deep learning techniques due to its socio-cultural and business importance. An appropriate dataset is an important resource for SER related studies in a particular language. There is an apparent lack of SER datasets in Bangla language although it is one of the most spoken languages in the world. There are a few Bangla SER datasets but those consist of only a few dialogs with a minimal number of actors making them unsuitable for real-world applications. Moreover, the existing datasets do not consider the intensity level of emotions. The intensity of a specific emotional expression, such as anger or sadness, plays a crucial role in social behavior. Therefore, a realistic Bangla speech dataset is developed in this study which is called KUET Bangla Emotional Speech (KBES) dataset. The dataset consists of 900 audio signals (i.e., speech dialogs) from 35 actors (20 females and 15 males) with diverse age ranges. Source of the speech dialogs are Bangla Telefilm, Drama, TV Series, Web Series. There are five emotional categories: Neutral, Happy, Sad, Angry, and Disgust. Except Neutral, samples of a particular emotion are divided into two intensity levels: Low and High. The significant issue of the dataset is that the speech dialogs are almost unique with relatively large number of actors; whereas, existing datasets (such as SUBESCO and BanglaSER) contain samples with repeatedly spoken of a few pre-defined dialogs by a few actors/research volunteers in the laboratory environment. Finally, the KBES dataset is exposed as a nine-class problem to classify emotions into nine categories: Neutral, Happy (Low), Happy (High), Sad (Low), Sad (High), Angry (Low), Angry (High), Disgust (Low) and Disgust (High). However, the dataset is kept symmetrical containing 100 samples for each of the nine classes; 100 samples are also gender balanced with 50 samples for male/female actors. The developed dataset seems a realistic dataset while compared with the existing SER datasets.

Dataset Information

A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.

Publications

A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets