Dataset Information

Directions in abusive language training data, a systematic review: Garbage in, garbage out.

ABSTRACT: Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.

SUBMITTER: Vidgen B

PROVIDER: S-EPMC7769249 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Directions in abusive language training data, a systematic review: Garbage in, garbage out.

Vidgen Bertie B Derczynski Leon L

PloS one 20201228 12

Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-inform ...[more]

PMID: 33370298

Similar Datasets

Project description:Introduction: The identification of chemical compounds that interfere with SARS-CoV-2 replication continues to be a priority in several academic and pharmaceutical laboratories. Computational tools and approaches have the power to integrate, process and analyze multiple data in a short time. However, these initiatives may yield unrealistic results if the applied models are not inferred from reliable data and the resulting predictions are not confirmed by experimental evidence. Methods: We undertook a drug discovery campaign against the essential major protease (MPro) from SARS-CoV-2, which relied on an in silico search strategy -performed in a large and diverse chemolibrary- complemented by experimental validation. The computational method comprises a recently reported ligand-based approach developed upon refinement/learning cycles, and structure-based approximations. Search models were applied to both retrospective (in silico) and prospective (experimentally confirmed) screening. Results: The first generation of ligand-based models were fed by data, which to a great extent, had not been published in peer-reviewed articles. The first screening campaign performed with 188 compounds (46 in silico hits and 100 analogues, and 40 unrelated compounds: flavonols and pyrazoles) yielded three hits against MPro (IC50 ≤ 25 μM): two analogues of in silico hits (one glycoside and one benzo-thiazol) and one flavonol. A second generation of ligand-based models was developed based on this negative information and newly published peer-reviewed data for MPro inhibitors. This led to 43 new hit candidates belonging to different chemical families. From 45 compounds (28 in silico hits and 17 related analogues) tested in the second screening campaign, eight inhibited MPro with IC50 = 0.12-20 μM and five of them also impaired the proliferation of SARS-CoV-2 in Vero cells (EC50 7-45 μM). Discussion: Our study provides an example of a virtuous loop between computational and experimental approaches applied to target-focused drug discovery against a major and global pathogen, reaffirming the well-known "garbage in, garbage out" machine learning principle.

Project description:BackgroundSocial media have transformed the communications landscape. People increasingly obtain news and health information online and via social media. Social media platforms also serve as novel sources of rich observational data for health research (including infodemiology, infoveillance, and digital disease detection detection). While the number of studies using social data is growing rapidly, very few of these studies transparently outline their methods for collecting, filtering, and reporting those data. Keywords and search filters applied to social data form the lens through which researchers may observe what and how people communicate about a given topic. Without a properly focused lens, research conclusions may be biased or misleading. Standards of reporting data sources and quality are needed so that data scientists and consumers of social media research can evaluate and compare methods and findings across studies.ObjectiveWe aimed to develop and apply a framework of social media data collection and quality assessment and to propose a reporting standard, which researchers and reviewers may use to evaluate and compare the quality of social data across studies.MethodsWe propose a conceptual framework consisting of three major steps in collecting social media data: develop, apply, and validate search filters. This framework is based on two criteria: retrieval precision (how much of retrieved data is relevant) and retrieval recall (how much of the relevant data is retrieved). We then discuss two conditions that estimation of retrieval precision and recall rely on--accurate human coding and full data collection--and how to calculate these statistics in cases that deviate from the two ideal conditions. We then apply the framework on a real-world example using approximately 4 million tobacco-related tweets collected from the Twitter firehose.ResultsWe developed and applied a search filter to retrieve e-cigarette-related tweets from the archive based on three keyword categories: devices, brands, and behavior. The search filter retrieved 82,205 e-cigarette-related tweets from the archive and was validated. Retrieval precision was calculated above 95% in all cases. Retrieval recall was 86% assuming ideal conditions (no human coding errors and full data collection), 75% when unretrieved messages could not be archived, 86% assuming no false negative errors by coders, and 93% allowing both false negative and false positive errors by human coders.ConclusionsThis paper sets forth a conceptual framework for the filtering and quality evaluation of social data that addresses several common challenges and moves toward establishing a standard of reporting social data. Researchers should clearly delineate data sources, how data were accessed and collected, and the search filter building process and how retrieval precision and recall were calculated. The proposed framework can be adapted to other public social media platforms.

Project description:BackgroundIn recent years, health data collected during the clinical care process have been often repurposed for secondary use through clinical data warehouses (CDWs), which interconnect disparate data from different sources. A large amount of information of high clinical value is stored in unstructured text format. Natural language processing (NLP), which implements algorithms that can operate on massive unstructured textual data, has the potential to structure the data and make clinical information more accessible.ObjectiveThe aim of this review was to provide an overview of studies applying NLP to textual data from CDWs. It focuses on identifying the (1) NLP tasks applied to data from CDWs and (2) NLP methods used to tackle these tasks.MethodsThis review was performed according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. We searched for relevant articles in 3 bibliographic databases: PubMed, Google Scholar, and ACL Anthology. We reviewed the titles and abstracts and included articles according to the following inclusion criteria: (1) focus on NLP applied to textual data from CDWs, (2) articles published between 1995 and 2021, and (3) written in English.ResultsWe identified 1353 articles, of which 194 (14.34%) met the inclusion criteria. Among all identified NLP tasks in the included papers, information extraction from clinical text (112/194, 57.7%) and the identification of patients (51/194, 26.3%) were the most frequent tasks. To address the various tasks, symbolic methods were the most common NLP methods (124/232, 53.4%), showing that some tasks can be partially achieved with classical NLP techniques, such as regular expressions or pattern matching that exploit specialized lexica, such as drug lists and terminologies. Machine learning (70/232, 30.2%) and deep learning (38/232, 16.4%) have been increasingly used in recent years, including the most recent approaches based on transformers. NLP methods were mostly applied to English language data (153/194, 78.9%).ConclusionsCDWs are central to the secondary use of clinical texts for research purposes. Although the use of NLP on data from CDWs is growing, there remain challenges in this field, especially with regard to languages other than English. Clinical NLP is an effective strategy for accessing, extracting, and transforming data from CDWs. Information retrieved with NLP can assist in clinical research and have an impact on clinical practice.

Project description:ImportanceTraining parents to implement strategies to support child language development is crucial to support long-term outcomes, given that as many as 2 of 5 children younger than 5 years have difficulty learning language.ObjectiveTo examine the association between parent training and language and communication outcomes in young children.Data sourcesSearches of ERIC, Academic Search Complete, PsycINFO, and PsycARTICLES were conducted on August 11, 2014; August 18, 2016; January 23, 2018; and October 30, 2018.Study selectionStudies included in this review and meta-analysis were randomized or nonrandomized clinical trials that evaluated a language intervention that included parent training with children with a mean age of less than 6 years. Studies were excluded if the parent was not the primary implementer of the intervention, the study included fewer than 10 participants, or the study did not report outcomes related to language or communication.Data extraction and synthesisPreferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines were applied to a total of 31 778 articles identified for screening, with the full text of 723 articles reviewed and 76 total studies ultimately included.Main outcomes and measuresMain outcomes included language and communication skills in children with primary or secondary language impairment and children at risk for language impairment.ResultsThis meta-analysis included 59 randomized clinical trials and 17 nonrandomized clinical trials including 5848 total participants (36.4 female [20.8%]; mean [SD] age, 3.5 [3.9] years). The intervention approach in 63 studies was a naturalistic teaching approach, and 16 studies used a primarily dialogic reading approach. There was a significant moderate association between parent training and child communication, engagement, and language outcomes (mean [SE] Hedges g, -0.33 [0.06]; P < .001). The association between parent training and parent use of language support strategies was large (mean [SE] Hedges g, 0.55 [0.11], P < .001). Children with developmental language disorder had the largest social communication outcomes (mean [SE] Hedges g, 0.37 [0.17]); large and significant associations were observed for receptive (mean [SE] Hedges g, 0.92 [0.30]) and expressive language (mean [SE] Hedges g, 0.83 [0.20]). Children at risk for language impairments had moderate effect sizes across receptive language (mean [SE] Hedges g, 0.28 [0.15]) and engagement outcomes (mean [SE] Hedges g, 0.36 [0.17]).Conclusions and relevanceThe findings suggest that training parents to implement language and communication intervention techniques is associated with improved outcomes for children and increased parent use of support strategies. These findings may have direct implications on intervention and prevention.

Dataset Information

Directions in abusive language training data, a systematic review: Garbage in, garbage out.

Publications

Directions in abusive language training data, a systematic review: Garbage in, garbage out.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets