Project description:BackgroundNew drug treatments are regularly approved, and it is challenging to remain up-to-date in this rapidly changing environment. Fast and accurate visualization is important to allow a global understanding of the drug market. Automation of this information extraction provides a helpful starting point for the subject matter expert, helps to mitigate human errors, and saves time.ObjectiveWe aimed to semiautomate disease population extraction from the free text of oncology drug approval descriptions from the BioMedTracker database for 6 selected drug targets. More specifically, we intended to extract (1) line of therapy, (2) stage of cancer of the patient population described in the approval, and (3) the clinical trials that provide evidence for the approval. We aimed to use these results in downstream applications, aiding the searchability of relevant content against related drug project sources.MethodsWe fine-tuned a state-of-the-art deep learning model, Bidirectional Encoder Representations from Transformers, for each of the 3 desired outputs. We independently applied rule-based text mining approaches. We compared the performances of deep learning and rule-based approaches and selected the best method, which was then applied to new entries. The results were manually curated by a subject matter expert and then used to train new models.ResultsThe training data set is currently small (433 entries) and will enlarge over time when new approval descriptions become available or if a choice is made to take another drug target into account. The deep learning models achieved 61% and 56% 5-fold cross-validated accuracies for line of therapy and stage of cancer, respectively, which were treated as classification tasks. Trial identification is treated as a named entity recognition task, and the 5-fold cross-validated F1-score is currently 87%. Although the scores of the classification tasks could seem low, the models comprise 5 classes each, and such scores are a marked improvement when compared to random classification. Moreover, we expect improved performance as the input data set grows, since deep learning models need to be trained on a large enough amount of data to be able to learn the task they are taught. The rule-based approach achieved 60% and 74% 5-fold cross-validated accuracies for line of therapy and stage of cancer, respectively. No attempt was made to define a rule-based approach for trial identification.ConclusionsWe developed a natural language processing algorithm that is currently assisting subject matter experts in disease population extraction, which supports health authority approvals. This algorithm achieves semiautomation, enabling subject matter experts to leverage the results for deeper analysis and to accelerate information retrieval in a crowded clinical environment such as oncology.

Project description:IntroductionActive tuberculosis (ATB), instigated by Mycobacterium tuberculosis (M.tb), rises as a primary instigator of morbidity and mortality within the realm of infectious illnesses. A significant portion of M.tb infections maintain an asymptomatic nature, recognizably termed as latent tuberculosis infections (LTBI). The complexities inherent to its diagnosis significantly hamper the initiatives aimed at its control and eventual eradication.MethodologyUtilizing the Gene Expression Omnibus (GEO), we procured two dedicated microarray datasets, labeled GSE39940 and GSE37250. The technique of weighted correlation network analysis was employed to discern the co-expression modules from the differentially expressed genes derived from the first dataset, GSE39940. Consequently, a pyroptosis-related module was garnered, facilitating the identification of a pyroptosis-related signature (PRS) diagnostic model through the application of a neural network algorithm. With the aid of Single Sample Gene Set Enrichment Analysis (ssGSEA), we further examined the immune cells engaged in the pyroptosis process in the context of active ATB. Lastly, dataset GSE37250 played a crucial role as a validating cohort, aimed at evaluating the diagnostic prowess of our model.ResultsIn executing the Weighted Gene Co-expression Network Analysis (WGCNA), a total of nine discrete co-expression modules were lucidly elucidated. Module 1 demonstrated a potent correlation with pyroptosis. A predictive diagnostic paradigm comprising three pyroptosis-related signatures, specifically AIM2, CASP8, and NAIP, was devised accordingly. The established PRS model exhibited outstanding accuracy across both cohorts, with the area under the curve (AUC) being respectively articulated as 0.946 and 0.787.ConclusionThe present research succeeded in identifying the pyroptosis-related signature within the pathogenetic framework of ATB. Furthermore, we developed a diagnostic model which exuded a remarkable potential for efficient and accurate diagnosis.

Project description:Purpose: The number of patients with alcohol-related problems is steadily increasing. A large-scale survey of alcohol-related problems has been conducted. However, studies that predict hazardous drinkers and identify which factors contribute to the prediction are limited. Thus, the purpose of this study was to predict hazardous drinkers and the severity of alcohol-related problems of patients using a deep learning algorithm based on a large-scale survey data. Materials and Methods: Datasets of National Health and Nutrition Examination Survey of South Korea (K-NHANES), a nationally representative survey for the entire South Korean population, were used to train deep learning and conventional machine learning algorithms. Datasets from 69,187 and 45,672 participants were used to predict hazardous drinkers and the severity of alcohol-related problems, respectively. Based on the degree of contribution of each variable to deep learning, it was possible to determine which variable contributed significantly to the prediction of hazardous drinkers. Results: Deep learning showed the higher performance than conventional machine learning algorithms. It predicted hazardous drinkers with an AUC (Area under the receiver operating characteristic curve) of 0.870 (Logistic regression: 0.858, Linear SVM: 0.849, Random forest classifier: 0.810, K-nearest neighbors: 0.740). Among 325 variables for predicting hazardous drinkers, energy intake was a factor showing the greatest contribution to the prediction, followed by carbohydrate intake. Participants were classified into Zone I, Zone II, Zone III, and Zone IV based on the degree of alcohol-related problems, showing AUCs of 0.881, 0.774, 0.853, and 0.879, respectively. Conclusion: Hazardous drinking groups could be effectively predicted and individuals could be classified according to the degree of alcohol-related problems using a deep learning algorithm. This algorithm could be used to screen people who need treatment for alcohol-related problems among the general population or hospital visitors.

Dataset Information

Development of the lyrics-based deep learning algorithm for identifying alcohol-related words (LYDIA)

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets