Dataset Information

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study.

ABSTRACT: BACKGROUND:The coronavirus disease (COVID-19) pandemic is a global health emergency with over 6 million cases worldwide as of the beginning of June 2020. The pandemic is historic in scope and precedent given its emergence in an increasingly digital era. Importantly, there have been concerns about the accuracy of COVID-19 case counts due to issues such as lack of access to testing and difficulty in measuring recoveries. OBJECTIVE:The aims of this study were to detect and characterize user-generated conversations that could be associated with COVID-19-related symptoms, experiences with access to testing, and mentions of disease recovery using an unsupervised machine learning approach. METHODS:Tweets were collected from the Twitter public streaming application programming interface from March 3-20, 2020, filtered for general COVID-19-related keywords and then further filtered for terms that could be related to COVID-19 symptoms as self-reported by users. Tweets were analyzed using an unsupervised machine learning approach called the biterm topic model (BTM), where groups of tweets containing the same word-related themes were separated into topic clusters that included conversations about symptoms, testing, and recovery. Tweets in these clusters were then extracted and manually annotated for content analysis and assessed for their statistical and geographic characteristics. RESULTS:A total of 4,492,954 tweets were collected that contained terms that could be related to COVID-19 symptoms. After using BTM to identify relevant topic clusters and removing duplicate tweets, we identified a total of 3465 (<1%) tweets that included user-generated conversations about experiences that users associated with possible COVID-19 symptoms and other disease experiences. These tweets were grouped into five main categories including first- and secondhand reports of symptoms, symptom reporting concurrent with lack of testing, discussion of recovery, confirmation of negative COVID-19 diagnosis after receiving testing, and users recalling symptoms and questioning whether they might have been previously infected with COVID-19. The co-occurrence of tweets for these themes was statistically significant for users reporting symptoms with a lack of testing and with a discussion of recovery. A total of 63% (n=1112) of the geotagged tweets were located in the United States. CONCLUSIONS:This study used unsupervised machine learning for the purposes of characterizing self-reporting of symptoms, experiences with testing, and mentions of recovery related to COVID-19. Many users reported symptoms they thought were related to COVID-19, but they were not able to get tested to confirm their concerns. In the absence of testing availability and confirmation, accurate case estimations for this period of the outbreak may never be known. Future studies should continue to explore the utility of infoveillance approaches to estimate COVID-19 disease severity.

SUBMITTER: Mackey T

PROVIDER: S-EPMC7282475 | biostudies-literature | 2020 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study.

Mackey Tim T Purushothaman Vidya V Li Jiawei J Shah Neal N Nali Matthew M Bardier Cortni C Liang Bryan B Cai Mingxiang M Cuomo Raphael R

JMIR public health and surveillance 20200608 2

<h4>Background</h4>The coronavirus disease (COVID-19) pandemic is a global health emergency with over 6 million cases worldwide as of the beginning of June 2020. The pandemic is historic in scope and precedent given its emergence in an increasingly digital era. Importantly, there have been concerns about the accuracy of COVID-19 case counts due to issues such as lack of access to testing and difficulty in measuring recoveries.<h4>Objective</h4>The aims of this study were to detect and characteri ...[more]

PMID: 32490846

Similar Datasets

Project description:BackgroundThe coronavirus disease (COVID-19) pandemic is perhaps the greatest global health challenge of the last century. Accompanying this pandemic is a parallel "infodemic," including the online marketing and sale of unapproved, illegal, and counterfeit COVID-19 health products including testing kits, treatments, and other questionable "cures." Enabling the proliferation of this content is the growing ubiquity of internet-based technologies, including popular social media platforms that now have billions of global users.ObjectiveThis study aims to collect, analyze, identify, and enable reporting of suspected fake, counterfeit, and unapproved COVID-19-related health care products from Twitter and Instagram.MethodsThis study is conducted in two phases beginning with the collection of COVID-19-related Twitter and Instagram posts using a combination of web scraping on Instagram and filtering the public streaming Twitter application programming interface for keywords associated with suspect marketing and sale of COVID-19 products. The second phase involved data analysis using natural language processing (NLP) and deep learning to identify potential sellers that were then manually annotated for characteristics of interest. We also visualized illegal selling posts on a customized data dashboard to enable public health intelligence.ResultsWe collected a total of 6,029,323 tweets and 204,597 Instagram posts filtered for terms associated with suspect marketing and sale of COVID-19 health products from March to April for Twitter and February to May for Instagram. After applying our NLP and deep learning approaches, we identified 1271 tweets and 596 Instagram posts associated with questionable sales of COVID-19-related products. Generally, product introduction came in two waves, with the first consisting of questionable immunity-boosting treatments and a second involving suspect testing kits. We also detected a low volume of pharmaceuticals that have not been approved for COVID-19 treatment. Other major themes detected included products offered in different languages, various claims of product credibility, completely unsubstantiated products, unapproved testing modalities, and different payment and seller contact methods.ConclusionsResults from this study provide initial insight into one front of the "infodemic" fight against COVID-19 by characterizing what types of health products, selling claims, and types of sellers were active on two popular social media platforms at earlier stages of the pandemic. This cybercrime challenge is likely to continue as the pandemic progresses and more people seek access to COVID-19 testing and treatment. This data intelligence can help public health agencies, regulatory authorities, legitimate manufacturers, and technology platforms better remove and prevent this content from harming the public.

Project description:BackgroundBlack women in the United States disproportionately suffer adverse pregnancy and birth outcomes compared to White women. Economic adversity and implicit bias during clinical encounters may lead to physiological responses that place Black women at higher risk for adverse birth outcomes. The novel coronavirus disease of 2019 (COVID-19) further exacerbated this risk, as safety protocols increased social isolation in clinical settings, thereby limiting opportunities to advocate for unbiased care. Twitter, 1 of the most popular social networking sites, has been used to study a variety of issues of public interest, including health care. This study considers whether posts on Twitter accurately reflect public discourse during the COVID-19 pandemic and are being used in infodemiology studies by public health experts.ObjectiveThis study aims to assess the feasibility of Twitter for identifying public discourse related to social determinants of health and advocacy that influence maternal health among Black women across the United States and to examine trends in sentiment between 2019 and 2020 in the context of the COVID-19 pandemic.MethodsTweets were collected from March 1 to July 13, 2020, from 21 organizations and influencers and from 4 hashtags that focused on Black maternal health. Additionally, tweets from the same organizations and hashtags were collected from the year prior, from March 1 to July 13, 2019. Twint, a Python programming library, was used for data collection and analysis. We gathered the text of approximately 17,000 tweets, as well as all publicly available metadata. Topic modeling and k-means clustering were used to analyze the tweets.ResultsA variety of trends were observed when comparing the 2020 data set to the 2019 data set from the same period. The percentages listed for each topic are probabilities of that topic occurring in our corpus. In our topic models, tweets on reproductive justice, maternal mortality crises, and patient care increased by 67.46% in 2020 versus 2019. Topics on community, advocacy, and health equity increased by over 30% in 2020 versus 2019. In contrast, tweet topics that decreased in 2020 versus 2019 were as follows: tweets on Medicaid and medical coverage decreased by 27.73%, and discussions about creating space for Black women decreased by just under 30%.ConclusionsThe results indicate that the COVID-19 pandemic may have spurred an increased focus on advocating for improved reproductive health and maternal health outcomes among Black women in the United States. Further analyses are needed to capture a longer time frame that encompasses more of the pandemic, as well as more diverse voices to confirm the robustness of the findings. We also concluded that Twitter is an effective source for providing a snapshot of relevant topics to guide Black maternal health advocacy efforts.

Project description:BACKGROUND:On December 6 and 7, 2017, the US Department of Health and Human Services (HHS) hosted its first Code-a-Thon event aimed at leveraging technology and data-driven solutions to help combat the opioid epidemic. The authors—an interdisciplinary team from academia, the private sector, and the US Centers for Disease Control and Prevention—participated in the Code-a-Thon as part of the prevention track. OBJECTIVE:The aim of this study was to develop and deploy a methodology using machine learning to accurately detect the marketing and sale of opioids by illicit online sellers via Twitter as part of participation at the HHS Opioid Code-a-Thon event. METHODS:Tweets were collected from the Twitter public application programming interface stream filtered for common prescription opioid keywords in conjunction with participation in the Code-a-Thon from November 15, 2017 to December 5, 2017. An unsupervised machine learning–based approach was developed and used during the Code-a-Thon competition (24 hours) to obtain a summary of the content of the tweets to isolate those clusters associated with illegal online marketing and sale using a biterm topic model (BTM). After isolating relevant tweets, hyperlinks associated with these tweets were reviewed to assess the characteristics of illegal online sellers. RESULTS:We collected and analyzed 213,041 tweets over the course of the Code-a-Thon containing keywords codeine, percocet, vicodin, oxycontin, oxycodone, fentanyl, and hydrocodone. Using BTM, 0.32% (692/213,041) tweets were identified as being associated with illegal online marketing and sale of prescription opioids. After removing duplicates and dead links, we identified 34 unique “live” tweets, with 44% (15/34) directing consumers to illicit online pharmacies, 32% (11/34) linked to individual drug sellers, and 21% (7/34) used by marketing affiliates. In addition to offering the “no prescription” sale of opioids, many of these vendors also sold other controlled substances and illicit drugs. CONCLUSIONS:The results of this study are in line with prior studies that have identified social media platforms, including Twitter, as a potential conduit for supply and sale of illicit opioids. To translate these results into action, authors also developed a prototype wireframe for the purposes of detecting, classifying, and reporting illicit online pharmacy tweets selling controlled substances illegally to the US Food and Drug Administration and the US Drug Enforcement Agency. Further development of solutions based on these methods has the potential to proactively alert regulators and law enforcement agencies of illegal opioid sales, while also making the online environment safer for the public.

Project description:BackgroundTwitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets.ObjectiveThis study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments.MethodsWe continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance.ResultsLSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks.ConclusionsWe derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.

Project description:BackgroundThe COVID-19 pandemic necessitated rapid real-time surveillance of epidemiological data to advise governments and the public, but the accuracy of these data depends on myriad auxiliary assumptions, not least accurate reporting of cases by the public. Wastewater monitoring has emerged internationally as an accurate and objective means for assessing disease prevalence with reduced latency and less dependence on public vigilance, reliability, and engagement. How public interest aligns with COVID-19 personal testing data and wastewater monitoring is, however, very poorly characterized.ObjectiveThis study aims to assess the associations between internet search volume data relevant to COVID-19, public health care statistics, and national-scale wastewater monitoring of SARS-CoV-2 across South Wales, United Kingdom, over time to investigate how interest in the pandemic may reflect the prevalence of SARS-CoV-2, as detected by national testing and wastewater monitoring, and how these data could be used to predict case numbers.MethodsRelative search volume data from Google Trends for search terms linked to the COVID-19 pandemic were extracted and compared against government-reported COVID-19 statistics and quantitative reverse transcription polymerase chain reaction (RT-qPCR) SARS-CoV-2 data generated from wastewater in South Wales, United Kingdom, using multivariate linear models, correlation analysis, and predictions from linear models.ResultsWastewater monitoring, most infoveillance terms, and nationally reported cases significantly correlated, but these relationships changed over time. Wastewater surveillance data and some infoveillance search terms generated predictions of case numbers that correlated with reported case numbers, but the accuracy of these predictions was inconsistent and many of the relationships changed over time.ConclusionsWastewater monitoring presents a valuable means for assessing population-level prevalence of SARS-CoV-2 and could be integrated with other data types such as infoveillance for increasingly accurate inference of virus prevalence. The importance of such monitoring is increasingly clear as a means of objectively assessing the prevalence of SARS-CoV-2 to circumvent the dynamic interest and participation of the public. Increased accessibility of wastewater monitoring data to the public, as is the case for other national data, may enhance public engagement with these forms of monitoring.

Dataset Information

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study.

Publications

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets