Dataset Information

A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities.

ABSTRACT: In this data article, we present to the data science, natural language processing and public heath communities an unlabeled corpus and a set of language models. We collected the data from Twitter using drug names as keywords, including their common misspelled forms. Using this data, which is rich in drug-related chatter, we developed language models to aid the development of data mining tools and methods in this domain. We generated several models that capture (i) distributed word representations and (ii) probabilities of n-gram sequences. The data set we are releasing consists of 267,215 Twitter posts made during the four-month period-November, 2014 to February, 2015. The posts mention over 250 drug-related keywords. The language models encapsulate semantic and sequential properties of the texts.

SUBMITTER: Sarker A

PROVIDER: S-EPMC5144647 | biostudies-literature | 2017 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities.

Sarker Abeed A Gonzalez Graciela G

Data in brief 20161123

In this data article, we present to the data science, natural language processing and public heath communities an unlabeled corpus and a set of language models. We collected the data from Twitter using drug names as keywords, including their common misspelled forms. Using this data, which is rich in drug-related chatter, we developed language models to aid the development of data mining tools and methods in this domain. We generated several models that capture (i) distributed word representation ...[more]

PMID: 27981203

Similar Datasets

Project description:When the Zika virus outbreak became a global health emergency in early 2016, the scientific community responded with an increased output of Zika-related research. This upsurge in research naturally made its way into academic journals along with editorials, news, and reports. However, it is not yet known how or whether these scholarly communications were distributed to the populations most affected by Zika.To understand how scientific outputs about Zika reached global and local audiences, we collected Tweets and Facebook posts that linked to Zika-related research in the first six months of 2016. Using a language detection algorithm, we found that up to 90% of Twitter and 76% of Facebook posts are in English. However, when none of the authors of the scholarly article are from English-speaking countries, posts on both social media are less likely to be in English. The effect is most pronounced on Facebook, where the likelihood of posting in English is between 11 and 16% lower when none of the authors are from English-speaking countries, as compared to when some or all are. Similarly, posts about papers written with a Brazilian author are 13% more likely to be in Portuguese on Facebook than when made on Twitter.Our main conclusion is that scholarly communication on Twitter and Facebook of Zika-related research is dominated by English, despite Brazil being the epicenter of the Zika epidemic. This result suggests that scholarly findings about the Zika virus are unlikely to be distributed directly to relevant populations through these popular online mediums. Nevertheless, there are differences between platforms. Compared to Twitter, scholarly communication on Facebook is more likely to be in the language of an author's country. The Zika outbreak provides a useful case-study for understanding how scientific outputs are communicated to relevant populations. Our results suggest that Facebook is a more effective channel than Twitter, if communication is desired to be in the native language of the affected country. Further research should explore how local media-such as governmental websites, newspapers and magazines, as well as television and radio-disseminate scholarly publication.

Project description:ImportanceAutomatic curation of consumer-generated, opioid-related social media big data may enable real-time monitoring of the opioid epidemic in the United States.ObjectiveTo develop and validate an automatic text-processing pipeline for geospatial and temporal analysis of opioid-mentioning social media chatter.Design, setting, and participantsThis cross-sectional, population-based study was conducted from December 1, 2017, to August 31, 2019, and used more than 3 years of publicly available social media posts on Twitter, dated from January 1, 2012, to October 31, 2015, that were geolocated in Pennsylvania. Opioid-mentioning tweets were extracted using prescription and illicit opioid names, including street names and misspellings. Social media posts (tweets) (n = 9006) were manually categorized into 4 classes, and training and evaluation of several machine learning algorithms were performed. Temporal and geospatial patterns were analyzed with the best-performing classifier on unlabeled data.Main outcomes and measuresPearson and Spearman correlations of county- and substate-level abuse-indicating tweet rates with opioid overdose death rates from the Centers for Disease Control and Prevention WONDER database and with 4 metrics from the National Survey on Drug Use and Health for 3 years were calculated. Classifier performances were measured through microaveraged F1 scores (harmonic mean of precision and recall) or accuracies and 95% CIs.ResultsA total of 9006 social media posts were annotated, of which 1748 (19.4%) were related to abuse, 2001 (22.2%) were related to information, 4830 (53.6%) were unrelated, and 427 (4.7%) were not in the English language. Yearly rates of abuse-indicating social media post showed statistically significant correlation with county-level opioid-related overdose death rates (n = 75) for 3 years (Pearson r = 0.451, P < .001; Spearman r = 0.331, P = .004). Abuse-indicating tweet rates showed consistent correlations with 4 NSDUH metrics (n = 13) associated with nonmedical prescription opioid use (Pearson r = 0.683, P = .01; Spearman r = 0.346, P = .25), illicit drug use (Pearson r = 0.850, P < .001; Spearman r = 0.341, P = .25), illicit drug dependence (Pearson r = 0.937, P < .001; Spearman r = 0.495, P = .09), and illicit drug dependence or abuse (Pearson r = 0.935, P < .001; Spearman r = 0.401, P = .17) over the same 3-year period, although the tests lacked power to demonstrate statistical significance. A classification approach involving an ensemble of classifiers produced the best performance in accuracy or microaveraged F1 score (0.726; 95% CI, 0.708-0.743).Conclusions and relevanceThe correlations obtained in this study suggest that a social media-based approach reliant on supervised machine learning may be suitable for geolocation-centric monitoring of the US opioid epidemic in near real time.

Project description:BackgroundThe wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers' perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the diversity of content in social media chatter.ObjectiveThis study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an example.MethodsWe collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website's search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a sample of tweets in 5 predetermined categories or other and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods.ResultsWe manually annotated 11,379 tweets (Corpus 1: 9179; Corpus 2: 2200) and used 7930 (69.7%) for training, 1449 (12.7%) for validation, and 2000 (17.6%) for testing. A classifier based on BERT obtained the highest accuracies (81.7%, Corpus 1; 80.7%, Corpus 2) and F1 scores on consumer feedback (0.58, Corpus 1; 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6%, RF on Corpus 1; 69.4%, RF on Corpus 2) and F1 score on consumer feedback (0.44, NN on Corpus 1; 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78%) and consumer feedback (15073/27337, 55.14%) tweets being the most frequent for Corpus 1 and Corpus 2, respectively.ConclusionsThe broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies.

Dataset Information

A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities.

Publications

A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets