Dataset Information

Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study.

ABSTRACT: BACKGROUND:Mental illness affects a significant portion of the worldwide population. Online mental health forums can provide a supportive environment for those afflicted and also generate a large amount of data that can be mined to predict mental health states using machine learning methods. OBJECTIVE:This study aimed to benchmark multiple methods of text feature representation for social media posts and compare their downstream use with automated machine learning (AutoML) tools. We tested on datasets that contain posts labeled for perceived suicide risk or moderator attention in the context of self-harm. Specifically, we assessed the ability of the methods to prioritize posts that a moderator would identify for immediate response. METHODS:We used 1588 labeled posts from the Computational Linguistics and Clinical Psychology (CLPsych) 2017 shared task collected from the Reachout.com forum. Posts were represented using lexicon-based tools, including Valence Aware Dictionary and sEntiment Reasoner, Empath, and Linguistic Inquiry and Word Count, and also using pretrained artificial neural network models, including DeepMoji, Universal Sentence Encoder, and Generative Pretrained Transformer-1 (GPT-1). We used Tree-based Optimization Tool and Auto-Sklearn as AutoML tools to generate classifiers to triage the posts. RESULTS:The top-performing system used features derived from the GPT-1 model, which was fine-tuned on over 150,000 unlabeled posts from Reachout.com. Our top system had a macroaveraged F1 score of 0.572, providing a new state-of-the-art result on the CLPsych 2017 task. This was achieved without additional information from metadata or preceding posts. Error analyses revealed that this top system often misses expressions of hopelessness. In addition, we have presented visualizations that aid in the understanding of the learned classifiers. CONCLUSIONS:In this study, we found that transfer learning is an effective strategy for predicting risk with relatively little labeled data and noted that fine-tuning of pretrained language models provides further gains when large amounts of unlabeled text are available.

SUBMITTER: Howard D

PROVIDER: S-EPMC7254287 | biostudies-literature | 2020 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study.

Howard Derek D Maslej Marta M MM Lee Justin J Ritchie Jacob J Woollard Geoffrey G French Leon L

Journal of medical Internet research 20200513 5

<h4>Background</h4>Mental illness affects a significant portion of the worldwide population. Online mental health forums can provide a supportive environment for those afflicted and also generate a large amount of data that can be mined to predict mental health states using machine learning methods.<h4>Objective</h4>This study aimed to benchmark multiple methods of text feature representation for social media posts and compare their downstream use with automated machine learning (AutoML) tools. ...[more]

PMID: 32401222

Similar Datasets

Project description:BackgroundAlthough COVID-19 vaccines have recently become available, efforts in global mass vaccination can be hampered by the widespread issue of vaccine hesitancy.ObjectiveThe aim of this study was to use social media data to capture close-to-real-time public perspectives and sentiments regarding COVID-19 vaccines, with the intention to understand the key issues that have captured public attention, as well as the barriers and facilitators to successful COVID-19 vaccination.MethodsTwitter was searched for tweets related to "COVID-19" and "vaccine" over an 11-week period after November 18, 2020, following a press release regarding the first effective vaccine. An unsupervised machine learning approach (ie, structural topic modeling) was used to identify topics from tweets, with each topic further grouped into themes using manually conducted thematic analysis as well as guided by the theoretical framework of the COM-B (capability, opportunity, and motivation components of behavior) model. Sentiment analysis of the tweets was also performed using the rule-based machine learning model VADER (Valence Aware Dictionary and Sentiment Reasoner).ResultsTweets related to COVID-19 vaccines were posted by individuals around the world (N=672,133). Six overarching themes were identified: (1) emotional reactions related to COVID-19 vaccines (19.3%), (2) public concerns related to COVID-19 vaccines (19.6%), (3) discussions about news items related to COVID-19 vaccines (13.3%), (4) public health communications about COVID-19 vaccines (10.3%), (5) discussions about approaches to COVID-19 vaccination drives (17.1%), and (6) discussions about the distribution of COVID-19 vaccines (20.3%). Tweets with negative sentiments largely fell within the themes of emotional reactions and public concerns related to COVID-19 vaccines. Tweets related to facilitators of vaccination showed temporal variations over time, while tweets related to barriers remained largely constant throughout the study period.ConclusionsThe findings from this study may facilitate the formulation of comprehensive strategies to improve COVID-19 vaccine uptake; they highlight the key processes that require attention in the planning of COVID-19 vaccination and provide feedback on evolving barriers and facilitators in ongoing vaccination drives to allow for further policy tweaks. The findings also illustrate three key roles of social media in COVID-19 vaccination, as follows: surveillance and monitoring, a communication platform, and evaluation of government responses.

Project description:BackgroundDementia is a global public health priority due to rapid growth of the aging population. As China has the world's largest population with dementia, this debilitating disease has created tremendous challenges for older adults, family caregivers, and health care systems on the mainland nationwide. However, public awareness and knowledge of the disease remain limited in Chinese society.ObjectiveThis study examines online public discourse and sentiment toward dementia among the Chinese public on a leading Chinese social media platform Weibo. Specifically, this study aims to (1) assess and examine public discourse and sentiment toward dementia among the Chinese public, (2) determine the extent to which dementia-related discourse and sentiment vary among different user groups (ie, government, journalists/news media, scientists/experts, and the general public), and (3) characterize temporal trends in public discourse and sentiment toward dementia among different user groups in China over the past decade.MethodsIn total, 983,039 original dementia-related posts published by 347,599 unique users between 2010 and 2021, together with their user information, were analyzed. Machine learning analytical techniques, including topic modeling, sentiment analysis, and semantic network analyses, were used to identify salient themes/topics and their variations across different user groups (ie, government, journalists/news media, scientists/experts, and the general public).ResultsTopic modeling results revealed that symptoms, prevention, and social support are the most prevalent dementia-related themes on Weibo. Posts about dementia policy/advocacy have been increasing in volume since 2018. Raising awareness is the least discussed topic over time. Sentiment analysis indicated that Weibo users generally attach negative attitudes/emotions to dementia, with the general public holding a more negative attitude than other user groups.ConclusionsOverall, dementia has received greater public attention on social media since 2018. In particular, discussions related to dementia advocacy and policy are gaining momentum in China. However, disparaging language is still used to describe dementia in China; therefore, a nationwide initiative is needed to alter the public discourse on dementia. The results contribute to previous research by providing a macrolevel understanding of the Chinese public's discourse and attitudes toward dementia, which is essential for building national education and policy initiatives to create a dementia-friendly society. Our findings indicate that dementia is associated with negative sentiments, and symptoms and prevention dominate public discourse. The development of strategies to address unfavorable perceptions of dementia requires policy and public health attention. The results further reveal that an urgent need exists to increase public knowledge about dementia. Social media platforms potentially could be leveraged for future dementia education interventions to increase dementia awareness and promote positive attitudes.

Project description:BackgroundPatient-based analysis of social media is a growing research field with the aim of delivering precision medicine but it requires accurate classification of posts relating to patients' experiences. We motivate the need for this type of classification as a pre-processing step for further analysis of social media data in the context of related work in this area. In this paper we present experiments for a three-way document classification by patient voice, professional voice or other. We present results for a convolutional neural network classifier trained on English data from two different data sources (Reddit and Twitter) and two domains (cardiovascular and skin diseases).ResultsWe found that document classification by patient voice, professional voice or other can be done consistently manually (0.92 accuracy). Annotators agreed roughly equally for each domain (cardiovascular and skin) but they agreed more when annotating Reddit posts compared to Twitter posts. Best classification performance was obtained when training two separate classifiers for each data source, one for Reddit and one for Twitter posts, when evaluating on in-source test data for both test sets combined with an overall accuracy of 0.95 (and macro-average F1 of 0.92) and an F1-score of 0.95 for patient voice only.ConclusionThe main conclusion resulting from this work is that combining social media data from platforms with different characteristics for training a patient and professional voice classifier does not result in best possible performance. We showed that it is best to train separate models per data source (Reddit and Twitter) instead of a model using the combined training data from both sources. We also found that it is preferable to train separate models per domain (cardiovascular and skin) while showing that the difference to the combined model is only minor (0.01 accuracy). Our highest overall F1-score (0.95) obtained for classifying posts as patient voice is a very good starting point for further analysis of social media data reflecting the experience of patients.

Project description:BackgroundEffective suicide risk assessments and interventions are vital for suicide prevention. Although assessing such risks is best done by health care professionals, people experiencing suicidal ideation may not seek help. Hence, machine learning (ML) and computational linguistics can provide analytical tools for understanding and analyzing risks. This, therefore, facilitates suicide intervention and prevention.ObjectiveThis study aims to explore, using statistical analyses and ML, whether computerized language analysis could be applied to assess and better understand a person's suicide risk on social media.MethodsWe used the University of Maryland Suicidality Dataset comprising text posts written by users (N=866) of mental health-related forums on Reddit. Each user was classified with a suicide risk rating (no, low, moderate, or severe) by either medical experts or crowdsourced annotators, denoting their estimated likelihood of dying by suicide. In language analysis, the Linguistic Inquiry and Word Count lexicon assessed sentiment, thinking styles, and part of speech, whereas readability was explored using the TextStat library. The Mann-Whitney U test identified differences between at-risk (low, moderate, and severe risk) and no-risk users. Meanwhile, the Kruskal-Wallis test and Spearman correlation coefficient were used for granular analysis between risk levels and to identify redundancy, respectively. In the ML experiments, gradient boost, random forest, and support vector machine models were trained using 10-fold cross validation. The area under the receiver operator curve and F1-score were the primary measures. Finally, permutation importance uncovered the features that contributed the most to each model's decision-making.ResultsStatistically significant differences (P<.05) were identified between the at-risk (671/866, 77.5%) and no-risk groups (195/866, 22.5%). This was true for both the crowd- and expert-annotated samples. Overall, at-risk users had higher median values for most variables (authenticity, first-person pronouns, and negation), with a notable exception of clout, which indicated that at-risk users were less likely to engage in social posturing. A high positive correlation (ρ>0.84) was present between the part of speech variables, which implied redundancy and demonstrated the utility of aggregate features. All ML models performed similarly in their area under the curve (0.66-0.68); however, the random forest and gradient boost models were noticeably better in their F1-score (0.65 and 0.62) than the support vector machine (0.52). The features that contributed the most to the ML models were authenticity, clout, and negative emotions.ConclusionsIn summary, our statistical analyses found linguistic features associated with suicide risk, such as social posturing (eg, authenticity and clout), first-person singular pronouns, and negation. This increased our understanding of the behavioral and thought patterns of social media users and provided insights into the mechanisms behind ML models. We also demonstrated the applicative potential of ML in assisting health care professionals to assess and manage individuals experiencing suicide risk.

Dataset Information

Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study.

Publications

Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets