Unknown

Dataset Information

0

Multi-label emotion classification of Urdu tweets.


ABSTRACT: Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.

SUBMITTER: Ashraf N 

PROVIDER: S-EPMC9044368 | biostudies-literature | 2022

REPOSITORIES: biostudies-literature

altmetric image

Publications

Multi-label emotion classification of Urdu tweets.

Ashraf Noman N   Khan Lal L   Butt Sabur S   Chang Hsien-Tsung HT   Sidorov Grigori G   Gelbukh Alexander A  

PeerJ. Computer science 20220422


Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers su  ...[more]

Similar Datasets

| S-EPMC11411592 | biostudies-literature
| S-EPMC9138108 | biostudies-literature
| S-EPMC7924502 | biostudies-literature
| S-EPMC7924696 | biostudies-literature
| S-EPMC8337005 | biostudies-literature
| S-EPMC11441674 | biostudies-literature
| S-EPMC8627225 | biostudies-literature
| S-EPMC3864256 | biostudies-literature
| S-EPMC9381526 | biostudies-literature
| S-EPMC7529316 | biostudies-literature