Dataset Information

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.

ABSTRACT: BACKGROUND:Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited. OBJECTIVE:The primary objectives of this study were (i) to assess whether rare health-related events-in this case, birth defects-are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis. METHODS:To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user's child has a birth defect, and (ii) accessibility to the user's tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter. RESULTS:We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user's child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: ??=?0.79 (Cohen's kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95. CONCLUSIONS:Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.

SUBMITTER: Klein AZ

PROVIDER: S-EPMC6295660 | biostudies-literature | 2018 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.

Klein Ari Z AZ Sarker Abeed A Cai Haitao H Weissenbacher Davy D Gonzalez-Hernandez Graciela G

Journal of biomedical informatics 20181004

<h4>Background</h4>Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited.<h4>Objective</h4>The primary objectives of this study were (i) to assess whether rare health-related events-in this case, birth defects-are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the c ...[more]

PMID: 30292855

Dataset Information

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.

Publications

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Towards scaling Twitter for digital epidemiology of birth defects.
| S-EPMC6773753 | biostudies-literature

Rising tides or rising stars?: Dynamics of shared attention on Twitter during media events.
| S-EPMC4031071 | biostudies-other

How loneliness is talked about in social media during COVID-19 pandemic: Text mining of 4,492 Twitter feeds.
| S-EPMC8754394 | biostudies-literature

Bayesian rule learning for biomedical data mining.
| S-EPMC2852212 | biostudies-literature

GatewayNet: a form of sequential rule mining.
| S-EPMC6480909 | biostudies-literature

Pregex: Rule-Based Detection and Extraction of Twitter Data in Pregnancy.
| S-EPMC9951068 | biostudies-literature

Applying negative rule mining to improve genome annotation.
| S-EPMC1940032 | biostudies-literature

Twitter sentiment around the Earnings Announcement events.
| S-EPMC5325598 | biostudies-literature

Fast rule-based bioactivity prediction using associative classification mining.
| S-EPMC3515428 | biostudies-literature

Microbial genotype-phenotype mapping by class association rule mining.
| S-EPMC2718668 | biostudies-literature