Project description:Although the use of network simulator (NS) in predicting the behavior of computer networks has increased, the users often face a variety of challenges and share them on Stack Overflow (SO). However, the challenges that users deal with have not been studied. This paper presents an NS discussion dataset extracted from SOTorrent, which consists of 2,322 NS-related question posts spanning 17 features. The process of data collection was conducted in five steps, including filtering initial post dataset using simulator tags, discovering NS-related tags, collecting the tagged posts, extracting the posts title and preprocessing for LDA (Latent Dirichlet Allocation), and finally applying the LDA topic modeling to obtain the NS posts clustered into eight different topic names. We believe that this dataset will help research community in highlighting issues faced by NS users.
Project description:Stack Overflow is currently the largest programming related question and answer community, containing multiple programming areas. The change of user's interest is the micro-representation of the intersection of macro-knowledge and has been widely studied in scientific fields, such as literature data sets. However, there is still very little research for the general public, such as the question and answer community. Therefore, we analyze the interest changes of 2,307,720 users in Stack Overflow in this work. Specifically, we classify the tag network in the community, vectorize the topic of questions to quantify the user's interest change patterns. Results show that the change pattern of user interest has the characteristic of a power-law distribution, which is different from the exponential distribution of scientists' interest change, but they are all affected by three features, heterogeneity, recency and proximity. Furthermore, the relationship between users' reputations and interest changes is negatively correlated, suggesting the importance of concentration, i.e., those who focus on specific areas are more likely to gain a higher reputation. In general, our work is a supplement to the public interest changes in science, and it can also help community managers better design recommendation algorithms and promote the healthy development of communities.
Project description:Topic modeling utilizes unsupervised machine learning to detect underlying themes within texts and has been deployed routinely to analyze social media for insights into healthcare issues. However, the inherent messiness of social media hinders the full realization of this technique's potential. As such, we hypothesized that restricting medical concepts in social media texts to specific related semantic types and applying topic modeling to these concepts could be a feasible approach to overcome the challenge of traditional topic modeling for social media texts. Therefore, we developed a semantic-type-based topic modeling pipeline to discover self-reported health-related topics. This pipeline integrated semantic type information and Systematized Medical Nomenclature for Medicine (SNOMED) precoordinated expressions into a traditional topic modeling approach to enhance effectiveness in clustering meaningful, distinct topics. Using social media texts regarding statins for illustration, we evaluated the efficacy of this new approach and validated a newly identified topic using real-world clinical data. Based on expert evaluations, this approach resulted in more novel, distinguishable, and meaningful health-related topics compared to traditional topic modeling. In addition, our electronic health record validation for a newly identified topic in two real-world clinical databases indicated that statin users had a higher prevalence of depression or anxiety compared to matched non-users. Our results indicate that this new topic modeling pipeline can improve the extraction of themes from noisy online discussions, thereby contributing to deeper insights for healthcare research.
Project description:PurposeThe objective of this study was to analyze proposed Korean nursing legislation as depicted in newspaper articles, to highlight issues related to the legislative process for this potential law, and to better understand social awareness regarding this matter.MethodsThe study focused on articles from 11 leading newspapers in Korea, published between February 2020 and August 2023, that pertained to nursing legislation. The articles were retrieved from the BigKinds database. Following text preprocessing, analytical methods including term frequency-inverse document frequency were employed, along with latent Dirichlet allocation (LDA), for word and topic modeling analysis. Additionally, LDA was applied across time periods to examine temporal changes in topics.ResultsFollowing preprocessing, a total of 7,967 words were extracted from the 991 articles selected for analysis. The primary themes identified in newspaper articles concerning the nursing legislation were organized into three main topics: 1) the necessity and impact of enactment of the nursing law, 2) the political context surrounding enactment of the law, and 3) the conflicts between and actions of healthcare organizations related to enactment of the law.ConclusionsThe findings confirmed that media coverage regarding the proposed nursing legislation primarily concentrated on the political and social conflicts associated with the law's passage, rather than its necessity and substance. More compelling evidence must be presented concerning the influence of the nursing workforce and the work environment of nurses on patient safety and health outcomes. Additionally, strategies should be devised to improve public comprehension of the nursing law's provisions.
Project description:BackgroundMaintaining a healthy weight can reduce the risk of developing many diseases, including type 2 diabetes, hypertension, and certain types of cancers. Online social media platforms are popular among people seeking social support regarding weight loss and sharing their weight loss experiences, which provides opportunities for learning about weight loss behaviors.ObjectiveThis study aimed to investigate the extent to which the content posted by users in the r/loseit subreddit, an online community for discussing weight loss, and online interactions were associated with their weight loss in terms of the number of replies and votes that these users received.MethodsAll posts that were published before January 2018 in r/loseit were collected. We focused on users who revealed their start weight, current weight, and goal weight and were active in this online community for at least 30 days. A topic modeling technique and a hierarchical clustering algorithm were used to obtain both global topics and local word semantic clusters. Finally, we used a regression model to learn the association between weight loss and topics, word semantic clusters, and online interactions.ResultsOur data comprised 477,904 posts that were published by 7660 users within a span of 7 years. We identified 25 topics, including food and drinks, calories, exercises, family members and friends, and communication. Our results showed that the start weight (β=.823; P<.001), active days (β=.017; P=.009), and median number of votes (β=.263; P=.02), mentions of exercises (β=.145; P<.001), and nutrition (β=.120; P<.001) were associated with higher weight loss. Users who lost more weight might be motivated by the negative emotions (β=-.098; P<.001) that they experienced before starting the journey of weight loss. In contrast, users who mentioned vacations (β=-.108; P=.005) and payments (β=-.112; P=.001) tended to experience relatively less weight loss. Mentions of family members (β=-.031; P=.03) and employment status (β=-.041; P=.03) were associated with less weight loss as well.ConclusionsOur study showed that both online interactions and offline activities were associated with weight loss, suggesting that future interventions based on existing online platforms should focus on both aspects. Our findings suggest that online personal health data can be used to learn about health-related behaviors effectively.
Project description:Impoverished capacity for social inference is one of several symptoms that are common to both agenesis of the corpus callosum (AgCC) and Autism Spectrum Disorder (ASD). This research compared the ability of 14 adults with AgCC, 13 high-functioning adults with ASD and 14 neurotypical controls to accurately attribute social meaning to the interactions of animated triangles. Descriptions of the animations were analyzed in three ways: subjective ratings, Linguistic Inquiry and Word Count, and topic modeling (Latent Dirichlet Allocation). Although subjective ratings indicated that all groups made similar inferences from the animations, the index of perplexity (atypicality of topic) generated from topic modeling revealed that inferences from individuals with AgCC or ASD displayed significantly less social imagination than those of controls.
Project description:Writing a high-quality, multiple-choice test item is a complex process. Creating plausible but incorrect options for each item poses significant challenges for the content specialist because this task is often undertaken without implementing a systematic method. In the current study, we describe and demonstrate a systematic method for creating plausible but incorrect options, also called distractors, based on students' misconceptions. These misconceptions are extracted from the labeled written responses. One thousand five hundred and fifteen written responses from an existing constructed-response item in Biology from Grade 10 students were used to demonstrate the method. Using a topic modeling procedure commonly used with machine learning and natural language processing called latent dirichlet allocation, 22 plausible misconceptions from students' written responses were identified and used to produce a list of plausible distractors based on students' responses. These distractors, in turn, were used as part of new multiple-choice items. Implications for item development are discussed.
Project description:BackgroundBecause of the growing involvement of communities from various disciplines, data science is constantly evolving and gaining popularity. The growing interest in data science-based services and applications presents numerous challenges for their development. Therefore, data scientists frequently turn to various forums, particularly domain-specific Q&A websites, to solve difficulties. These websites evolve into data science knowledge repositories over time. Analysis of such repositories can provide valuable insights into the applications, topics, trends, and challenges of data science.MethodsIn this article, we investigated what data scientists are asking by analyzing all posts to date on DSSE, a data science-focused Q&A website. To discover main topics embedded in data science discussions, we used latent Dirichlet allocation (LDA), a probabilistic approach for topic modeling.ResultsAs a result of this analysis, 18 main topics were identified that demonstrate the current interests and issues in data science. We then examined the topics' popularity and difficulty. In addition, we identified the most commonly used tasks, techniques, and tools in data science. As a result, "Model Training", "Machine Learning", and "Neural Networks" emerged as the most prominent topics. Also, "Data Manipulation", "Coding Errors", and "Tools" were identified as the most viewed (most popular) topics. On the other hand, the most difficult topics were identified as "Time Series", "Computer Vision", and "Recommendation Systems". Our findings have significant implications for many data science stakeholders who are striving to advance data-driven architectures, concepts, tools, and techniques.
Project description:What aspects of word meaning are important in early word learning and lexico-semantic network development? Adult lexico-semantic systems flexibly encode multiple types of semantic features, including functional, perceptual, taxonomic, and encyclopedic. However, various theoretical accounts of lexical development differ on whether and how these semantic properties of word meanings are initially encoded into young children's emerging lexico-semantic networks. Whereas some accounts highlight the importance of early perceptual versus conceptual properties, others posit that thematic or functional aspects of word meaning are primary relative to taxonomic knowledge. We seek to shed light on these debates with 2 modeling studies that explore patterns in early word learning using a large database of early vocabulary in 5,450 children, and a newly developed set of semantic features of early acquired nouns. In Study 1, we ask whether semantic properties of early acquired words relate to order in which these words are typically learned; Study 2 models normative lexico-semantic noun-feature network development compared to random network growth. Both studies provide converging evidence that perceptual properties of word meanings play a key role in early word learning and lexico-semantic network development. The findings lend support to theoretical accounts of language learning that highlight the importance of the child's perceptual experience. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Project description:Topic Detection and Tracking (TDT) is a very active research question within the area of text mining, generally applied to news feeds and Twitter datasets, where topics and events are detected. The notion of "event" is broad, but typically it applies to occurrences that can be detected from a single post or a message. Little attention has been drawn to what we call "micro-events", which, due to their nature, cannot be detected from a single piece of textual information. The study investigates the feasibility of micro-event detection on textual data using a sample of messages from the Stack Overflow Q&A platform and Free/Libre Open Source Software (FLOSS) version releases from Libraries.io dataset. We build pipelines for detection of micro-events using three different estimators whose parameters are optimized using a grid search approach. We consider two feature spaces: LDA topic modeling with sentiment analysis, and hSBM topics with sentiment analysis. The feature spaces are optimized using the recursive feature elimination with cross validation (RFECV) strategy. In our experiments we investigate whether there is a characteristic change in the topics distribution or sentiment features before or after micro-events take place and we thoroughly evaluate the capacity of each variant of our analysis pipeline to detect micro-events. Additionally, we perform a detailed statistical analysis of the models, including influential cases, variance inflation factors, validation of the linearity assumption, pseudo R2 measures and no-information rate. Finally, in order to study limits of micro-event detection, we design a method for generating micro-event synthetic datasets with similar properties to the real-world data, and use them to identify the micro-event detectability threshold for each of the evaluated classifiers.