Dataset Information

CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French.

ABSTRACT: Modeling multimodal language is a core research area in natural language processing. While languages such as English have relatively large multimodal language resources, other widely spoken languages across the globe have few or no large-scale datasets in this area. This disproportionately affects native speakers of languages other than English. As a step towards building more equitable and inclusive multimodal systems, we introduce the first large-scale multimodal language dataset for Spanish, Portuguese, German and French. The proposed dataset, called CMU-MOSEAS (CMU Multimodal Opinion Sentiment, Emotions and Attributes), is the largest of its kind with 40, 000 total labelled sentences. It covers a diverse set topics and speakers, and carries supervision of 20 labels including sentiment (and subjectivity), emotions, and attributes. Our evaluations on a state-of-the-art multimodal model demonstrates that CMU-MOSEAS enables further research for multilingual studies in multimodal language.

SUBMITTER: Zadeh A

PROVIDER: S-EPMC8106386 | biostudies-literature | 2020 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French.

Zadeh Amir A Cao Yan Sheng YS Hessner Simon S Liang Paul Pu PP Poria Soujanya S Morency Louis-Philippe LP

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing 20201101

Modeling multimodal language is a core research area in natural language processing. While languages such as English have relatively large multimodal language resources, other widely spoken languages across the globe have few or no large-scale datasets in this area. This disproportionately affects native speakers of languages other than English. As a step towards building more equitable and inclusive multimodal systems, we introduce the first large-scale multimodal language dataset for Spanish, ...[more]

PMID: 33969362

Dataset Information

CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French.

Publications

CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

PROTOCOL: Effectiveness of Educational Programmes to Prevent and Counter Online Violent Extremist Propaganda in English, French, Spanish, Portuguese, German and Scandinavian Language Studies: A Systematic Review.
| S-EPMC12004397 | biostudies-literature

Benford’s Law applies to word frequency rank in English, German, French, Spanish, and Italian
| S-EPMC10501622 | biostudies-literature

A 204-subject multimodal neuroimaging dataset to study language processing.
| S-EPMC6472396 | biostudies-literature

A synchronized multimodal neuroimaging dataset for studying brain language processing.
| S-EPMC9525723 | biostudies-literature

Free Association Database for a 62-Word Dataset Including Emotion and Colour Terms in English, Estonian, French, German, Italian, Lithuanian, and Spanish: Data from 14 Countries.
| S-EPMC12330808 | biostudies-literature

Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers.
| S-EPMC8486854 | biostudies-literature

Job satisfaction of Spanish and Portuguese optometrists.
| S-EPMC10796967 | biostudies-literature

VLibrasBD: A Brazilian Portuguese–Brazilian sign language (Libras) bilingual text dataset designed to support neural machine translation
| S-EPMC12341731 | biostudies-literature

Lexical simplification benchmarks for English, Portuguese, and Spanish.
| S-EPMC9536312 | biostudies-literature

Mammography reporting dataset with BI-RADS system for natural language processing applications: Addressing public data gaps in Spanish
| S-EPMC12221635 | biostudies-literature