Dataset Information

ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment.

ABSTRACT: Toxic Language in social media is a newly emerging virtual disorder of human society. Detecting toxic language is an NLP task that requires a Dataset of utterances [1]. For the Bangla language, very few datasets have been developed on toxicity or similar concepts [2]. A dataset has been developed using user-generated content from Facebook and that will cover the demographic and thematic distribution of Bangla toxic language generated on the web. Therefore, 2207590 comments have been collected, annotated, and thus extract about 1959 unique bigrams as utterances, which were considered as base-entry of a toxic language dataset. The core derivatives of the dataset are bigram-based wordlists, which are annotated inductively and divided into 08 thematic classes that give some ideas on toxicity variations found in the Bengali community. These thematic classes cover political hate speech [3] and misogynist bullies dominantly. However, these thematic labels will serve as classifiers in the text classification process through machine learning. In addition to the thematic classification labels, this dataset includes some additional features such as imprecise meanings in English, IPA transliteration, real occurrences in the source pages, spelling standards, and degree of toxicity. As this is a dataset of utterance, it has de-identified and anonymous entries and no difficulties for public disclosure. Therefore, we consider this dataset as Toxic lexicon (Toxlex) as an exhaustive wordlist that is essentially a curated value-added and analyzed dataset which can be used as classifier material to detect toxicity in social media.

SUBMITTER: Rashid MMO

PROVIDER: S-EPMC9256543 | biostudies-literature | 2022 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment.

Rashid Mohammad Mamun Or MMO

Data in brief 20220624

Toxic Language in social media is a newly emerging virtual disorder of human society. Detecting toxic language is an NLP task that requires a Dataset of utterances [1]. For the Bangla language, very few datasets have been developed on toxicity or similar concepts [2]. A dataset has been developed using user-generated content from Facebook and that will cover the demographic and thematic distribution of Bangla toxic language generated on the web. Therefore, 2207590 comments have been collected, a ...[more]

PMID: 35811647

Dataset Information

ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment.

Publications

ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

BTSD: A curated transformation of sentence dataset for text classification in Bangla language.
| S-EPMC10415831 | biostudies-literature

BDSL 49: A comprehensive dataset of Bangla sign language.
| S-EPMC10331282 | biostudies-literature

BanglaSER: A speech emotion recognition dataset for the Bangla language.
| S-EPMC8980634 | biostudies-literature

BanglaWriting: A multi-purpose offline Bangla handwriting dataset.
| S-EPMC7744928 | biostudies-literature

BaitBuster-Bangla: A comprehensive dataset for clickbait detection in Bangla with multi-feature and multi-modal analysis
| S-EPMC10912596 | biostudies-literature

Curated dataset of asphaltene structures.
| S-EPMC10716757 | biostudies-literature

BAAD: A multipurpose dataset for automatic Bangla offensive speech recognition.
| S-EPMC10070523 | biostudies-literature

BdSL47: A complete depth-based Bangla sign alphabet and digit dataset
| S-EPMC10700367 | biostudies-literature

A pooled treatment-curated breast cancer gene-expression dataset
2024-03-09 | GSE205568 | GEO

BanglaLekha-Isolated: A multi-purpose comprehensive dataset of Handwritten Bangla Isolated characters.
| S-EPMC5382023 | biostudies-literature