Dataset Information

Infoveillance of the Croatian Online Media During the COVID-19 Pandemic: One-Year Longitudinal Study Using Natural Language Processing.

ABSTRACT:

Background

Online media play an important role in public health emergencies and serve as essential communication platforms. Infoveillance of online media during the COVID-19 pandemic is an important step toward gaining a better understanding of crisis communication.

Objective

The goal of this study was to perform a longitudinal analysis of the COVID-19-related content on online media based on natural language processing.

Methods

We collected a data set of news articles published by Croatian online media during the first 13 months of the pandemic. First, we tested the correlations between the number of articles and the number of new daily COVID-19 cases. Second, we analyzed the content by extracting the most frequent terms and applied the Jaccard similarity coefficient. Third, we compared the occurrence of the pandemic-related terms during the two waves of the pandemic. Finally, we applied named entity recognition to extract the most frequent entities and tracked the dynamics of changes during the observation period.

Results

The results showed no significant correlation between the number of articles and the number of new daily COVID-19 cases. Furthermore, there were high overlaps in the terminology used in all articles published during the pandemic with a slight shift in the pandemic-related terms between the first and the second waves. Finally, the findings indicate that the most influential entities have lower overlaps for the identified people and higher overlaps for locations and institutions.

Conclusions

Our study shows that online media have a prompt response to the pandemic with a large number of COVID-19-related articles. There was a high overlap in the frequently used terms across the first 13 months, which may indicate the narrow focus of reporting in certain periods. However, the pandemic-related terminology is well-covered.

SUBMITTER: Beliga S

PROVIDER: S-EPMC8715984 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:BackgroundThe coronavirus disease (COVID-19) pandemic is perhaps the greatest global health challenge of the last century. Accompanying this pandemic is a parallel "infodemic," including the online marketing and sale of unapproved, illegal, and counterfeit COVID-19 health products including testing kits, treatments, and other questionable "cures." Enabling the proliferation of this content is the growing ubiquity of internet-based technologies, including popular social media platforms that now have billions of global users.ObjectiveThis study aims to collect, analyze, identify, and enable reporting of suspected fake, counterfeit, and unapproved COVID-19-related health care products from Twitter and Instagram.MethodsThis study is conducted in two phases beginning with the collection of COVID-19-related Twitter and Instagram posts using a combination of web scraping on Instagram and filtering the public streaming Twitter application programming interface for keywords associated with suspect marketing and sale of COVID-19 products. The second phase involved data analysis using natural language processing (NLP) and deep learning to identify potential sellers that were then manually annotated for characteristics of interest. We also visualized illegal selling posts on a customized data dashboard to enable public health intelligence.ResultsWe collected a total of 6,029,323 tweets and 204,597 Instagram posts filtered for terms associated with suspect marketing and sale of COVID-19 health products from March to April for Twitter and February to May for Instagram. After applying our NLP and deep learning approaches, we identified 1271 tweets and 596 Instagram posts associated with questionable sales of COVID-19-related products. Generally, product introduction came in two waves, with the first consisting of questionable immunity-boosting treatments and a second involving suspect testing kits. We also detected a low volume of pharmaceuticals that have not been approved for COVID-19 treatment. Other major themes detected included products offered in different languages, various claims of product credibility, completely unsubstantiated products, unapproved testing modalities, and different payment and seller contact methods.ConclusionsResults from this study provide initial insight into one front of the "infodemic" fight against COVID-19 by characterizing what types of health products, selling claims, and types of sellers were active on two popular social media platforms at earlier stages of the pandemic. This cybercrime challenge is likely to continue as the pandemic progresses and more people seek access to COVID-19 testing and treatment. This data intelligence can help public health agencies, regulatory authorities, legitimate manufacturers, and technology platforms better remove and prevent this content from harming the public.

Project description:ImportanceAutomatic curation of consumer-generated, opioid-related social media big data may enable real-time monitoring of the opioid epidemic in the United States.ObjectiveTo develop and validate an automatic text-processing pipeline for geospatial and temporal analysis of opioid-mentioning social media chatter.Design, setting, and participantsThis cross-sectional, population-based study was conducted from December 1, 2017, to August 31, 2019, and used more than 3 years of publicly available social media posts on Twitter, dated from January 1, 2012, to October 31, 2015, that were geolocated in Pennsylvania. Opioid-mentioning tweets were extracted using prescription and illicit opioid names, including street names and misspellings. Social media posts (tweets) (n = 9006) were manually categorized into 4 classes, and training and evaluation of several machine learning algorithms were performed. Temporal and geospatial patterns were analyzed with the best-performing classifier on unlabeled data.Main outcomes and measuresPearson and Spearman correlations of county- and substate-level abuse-indicating tweet rates with opioid overdose death rates from the Centers for Disease Control and Prevention WONDER database and with 4 metrics from the National Survey on Drug Use and Health for 3 years were calculated. Classifier performances were measured through microaveraged F1 scores (harmonic mean of precision and recall) or accuracies and 95% CIs.ResultsA total of 9006 social media posts were annotated, of which 1748 (19.4%) were related to abuse, 2001 (22.2%) were related to information, 4830 (53.6%) were unrelated, and 427 (4.7%) were not in the English language. Yearly rates of abuse-indicating social media post showed statistically significant correlation with county-level opioid-related overdose death rates (n = 75) for 3 years (Pearson r = 0.451, P < .001; Spearman r = 0.331, P = .004). Abuse-indicating tweet rates showed consistent correlations with 4 NSDUH metrics (n = 13) associated with nonmedical prescription opioid use (Pearson r = 0.683, P = .01; Spearman r = 0.346, P = .25), illicit drug use (Pearson r = 0.850, P < .001; Spearman r = 0.341, P = .25), illicit drug dependence (Pearson r = 0.937, P < .001; Spearman r = 0.495, P = .09), and illicit drug dependence or abuse (Pearson r = 0.935, P < .001; Spearman r = 0.401, P = .17) over the same 3-year period, although the tests lacked power to demonstrate statistical significance. A classification approach involving an ensemble of classifiers produced the best performance in accuracy or microaveraged F1 score (0.726; 95% CI, 0.708-0.743).Conclusions and relevanceThe correlations obtained in this study suggest that a social media-based approach reliant on supervised machine learning may be suitable for geolocation-centric monitoring of the US opioid epidemic in near real time.

Project description:BackgroundIn recent years, Korean society has increasingly recognized the importance of nurses in the context of population aging and infectious disease control. However, nurses still face difficulties with regard to policy activities that are aimed at improving the nursing workforce structure and working environment. Media coverage plays an important role in public awareness of a particular issue and can be an important strategy in policy activities.ObjectiveThis study analyzed data from 18 years of news coverage on nursing-related issues. The focus of this study was to examine the drivers of the social, local, economic, and political agendas that were emphasized in the media by the analysis of main sources and their quotes. This analysis revealed which nursing media agendas were emphasized (eg, social aspects), neglected (eg, policy aspects), and negotiated.MethodsDescriptive analysis, natural language processing, and semantic network analysis were applied to analyze data collected from 2005 to 2022. BigKinds were used for the collection of data, automatic multi-categorization of news, named entity recognition of news sources, and extraction and topic modeling of quotes. The main news sources were identified by conducting a 1-mode network analysis with SNAnalyzer. The main agendas of nursing-related news coverage were examined through the qualitative analysis of major sources' quotes by section. The common and individual interests of the top-ranked sources were analyzed through a 2-mode network analysis using UCINET.ResultsIn total, 128,339 articles from 54 media outlets on nursing-related issues were analyzed. Descriptive analysis showed that nursing-related news was mainly covered in social (99,868/128,339, 77.82%) and local (48,056/128,339, 48.56%) sections, whereas it was rarely covered in economic (9439/128,339, 7.35%) and political (7301/128,339, 5.69%) sections. Furthermore, 445 sources that had made the top 20 list at least once by year and section were analyzed. Other than "nurse," the main sources for each section were "labor union," "local resident," "government," and "Moon Jae-in." "Nursing Bill" emerged as a common interest among nurses and doctors, although the topic did not garner considerable attention from the Ministry of Health and Welfare. Analyzing quotes showed that nurses were portrayed as heroes, laborers, survivors of abuse, and perpetrators. The economic section focused on employment of youth and women in nursing. In the political section, conflicts between nurses and doctors, which may have caused policy confusion, were highlighted. Policy formulation processes were not adequately reported. Media coverage of the enactment of nursing laws tended to relate to confrontations between political parties.ConclusionsThe media plays a crucial role in highlighting various aspects of nursing practice. However, policy formulation processes to solve nursing issues were not adequately reported in South Korea. This study suggests that nurses should secure policy compliance by persuading the public to understand their professional perspectives.

Project description:BackgroundWhile scientific knowledge of post-COVID-19 condition (PCC) is growing, there remains significant uncertainty in the definition of the disease, its expected clinical course, and its impact on daily functioning. Social media platforms can generate valuable insights into patient-reported health outcomes as the content is produced at high resolution by patients and caregivers, representing experiences that may be unavailable to most clinicians.ObjectiveIn this study, we aimed to determine the validity and effectiveness of advanced natural language processing approaches built to derive insight into PCC-related patient-reported health outcomes from social media platforms Twitter and Reddit. We extracted PCC-related terms, including symptoms and conditions, and measured their occurrence frequency. We compared the outputs with human annotations and clinical outcomes and tracked symptom and condition term occurrences over time and locations to explore the pipeline's potential as a surveillance tool.MethodsWe used bidirectional encoder representations from transformers (BERT) models to extract and normalize PCC symptom and condition terms from English posts on Twitter and Reddit. We compared 2 named entity recognition models and implemented a 2-step normalization task to map extracted terms to unique concepts in standardized terminology. The normalization steps were done using a semantic search approach with BERT biencoders. We evaluated the effectiveness of BERT models in extracting the terms using a human-annotated corpus and a proximity-based score. We also compared the validity and reliability of the extracted and normalized terms to a web-based survey with more than 3000 participants from several countries.ResultsUmlsBERT-Clinical had the highest accuracy in predicting entities closest to those extracted by human annotators. Based on our findings, the top 3 most commonly occurring groups of PCC symptom and condition terms were systemic (such as fatigue), neuropsychiatric (such as anxiety and brain fog), and respiratory (such as shortness of breath). In addition, we also found novel symptom and condition terms that had not been categorized in previous studies, such as infection and pain. Regarding the co-occurring symptoms, the pair of fatigue and headaches was among the most co-occurring term pairs across both platforms. Based on the temporal analysis, the neuropsychiatric terms were the most prevalent, followed by the systemic category, on both social media platforms. Our spatial analysis concluded that 42% (10,938/26,247) of the analyzed terms included location information, with the majority coming from the United States, United Kingdom, and Canada.ConclusionsThe outcome of our social media-derived pipeline is comparable with the results of peer-reviewed articles relevant to PCC symptoms. Overall, this study provides unique insights into patient-reported health outcomes of PCC and valuable information about the patient's journey that can help health care providers anticipate future needs.International registered report identifier (irrid)RR2-10.1101/2022.12.14.22283419.