Dataset Information

Towards scaling Twitter for digital epidemiology of birth defects.

ABSTRACT: Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes-the leading cause of infant mortality-could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms-feature-engineered and deep learning-based classifiers-that automatically distinguish tweets referring to the user's pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F1-score of 0.65 for the "defect" class and 0.51 for the "possible defect" class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.

SUBMITTER: Klein AZ

PROVIDER: S-EPMC6773753 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Towards scaling Twitter for digital epidemiology of birth defects.

Klein Ari Z AZ Sarker Abeed A Weissenbacher Davy D Gonzalez-Hernandez Graciela G

NPJ digital medicine 20191001

Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes-the leading cause of infant mortality-could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train ...[more]

PMID: 31583284

Similar Datasets

Project description:Cleft lip with or without cleft palate (CL/P) and cleft palate only (CPO) are common congenital malformations. Numerous epidemiologic studies have shown an increased risk for orofacial clefts among children whose mothers smoked during early pregnancy; however, there is concern that the results of these studies may have been biased because of exposure misclassification. The purpose of this study is to use previous research on the reliability of self-reported cigarette smoking to produce corrected point estimates (and associated credible intervals) of the effect of maternal smoking on children's risk of clefts.We accounted for misclassification using 4 Bayesian models that made different assumptions about the sensitivity and specificity of self-reported maternal smoking data. We used results from previous studies to specify the prior distributions for sensitivity and specificity of reporting and used Markov chain Monte Carlo algorithms to calculate the posterior distribution of the effect of maternal smoking on children's risk for CL/P and CPO.After correcting for potential sources of misclassification in data from the National Birth Defects Prevention Study, we found an increased risk of CL/P among children born to mothers who smoked during early pregnancy (posterior odds ratio [OR] = 1.6, 95% credible interval = 1.1-2.2). The posterior effect of smoking on CPO provided less evidence of effect (posterior OR = 1.1, 95% credible interval = 0.7-1.7).Our results lend some credibility to the hypothesis that periconceptional maternal smoking increases the risk of a child being born with CL/P. The results concerning CPO provide no overall evidence of effect, although the estimates were relatively imprecise. We suggest that future research should emphasize validity studies, especially those of differential reporting, rather than replicating existing analyses of the relationship between maternal smoking and clefts. We discuss how our approach is also applicable to evaluating misclassification in a wide range of exposure-outcome scenarios.

Project description:BACKGROUND:Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited. OBJECTIVE:The primary objectives of this study were (i) to assess whether rare health-related events-in this case, birth defects-are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis. METHODS:To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user's child has a birth defect, and (ii) accessibility to the user's tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter. RESULTS:We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user's child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: ??=?0.79 (Cohen's kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95. CONCLUSIONS:Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.

Dataset Information

Towards scaling Twitter for digital epidemiology of birth defects.

Publications

Towards scaling Twitter for digital epidemiology of birth defects.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets