Project description:Diabetic nephropathy (DN), a multifaceted disease with various contributing factors, presents challenges in understanding its underlying causes. Uncovering biomarkers linked to this condition can shed light on its pathogenesis and support the creation of new diagnostic and treatment methods. Gene expression data were sourced from accessible public databases, and Weighted Gene Co-expression Network Analysis (WGCNA)was employed to pinpoint gene co-expression modules relevant to DN. Subsequently, various machine learning techniques, such as random forest, lasso regression algorithm (LASSO), and support vector machine-recursive feature elimination (SVM-REF), were utilized for distinguishing DN cases from controls using the identified gene modules. Additionally, functional enrichment analyses were conducted to explore the biological roles of these genes. Our analysis revealed 131 genes showing distinct expression patterns between controlled and uncontrolled groups. During the integrated WCGNA, we identified 61 co-expressed genes encompassing both categories. The enrichment analysis highlighted involvement in various immune responses and complex activities. Techniques like Random Forest, LASSO, and SVM-REF were applied to pinpoint key hub genes, leading to the identification of VWF and DNASE1L3. In the context of DN, they demonstrated significant consistency in both expression and function. Our research uncovered potential biomarkers for DN through the application of WGCNA and various machine learning methods. The results indicate that 2 central genes could serve as innovative diagnostic indicators and therapeutic targets for this disease. This discovery offers fresh perspectives on the development of DN and could contribute to the advancement of new diagnostic and treatment approaches.
Project description:Ransomware-related cyber-attacks have been on the rise over the last decade, disturbing organizations considerably. Developing new and better ways to detect this type of malware is necessary. This research applies dynamic analysis and machine learning to identify the ever-evolving ransomware signatures using selected dynamic features. Since most of the attributes are shared by diverse ransomware-affected samples, our study can be used for detecting current and even new variants of the threat. This research has the following objectives: (1) Execute experiments with encryptor and locker ransomware combined with goodware to generate JSON files with dynamic parameters using a sandbox. (2) Analyze and select the most relevant and non-redundant dynamic features for identifying encryptor and locker ransomware from goodware. (3) Generate and make public a dynamic features dataset that includes these selected parameters for samples of different artifacts. (4) Apply the dynamic feature dataset to obtain models with machine learning algorithms. Five platforms, 20 ransomware, and 20 goodware artifacts were evaluated. The final feature dataset is composed of 2000 registers of 50 characteristics each. This dataset allows for a machine learning detection with a 10-fold cross-evaluation with an average accuracy superior to 0.99 for gradient boosted regression trees, random forest, and neural networks.
Project description:In this study, we developed machine learning-based prediction models for early childhood caries and compared their performances with the traditional regression model. We analyzed the data of 4195 children aged 1-5 years from the Korea National Health and Nutrition Examination Survey data (2007-2018). Moreover, we developed prediction models using the XGBoost (version 1.3.1), random forest, and LightGBM (version 3.1.1) algorithms in addition to logistic regression. Two different methods were applied for variable selection, including a regression-based backward elimination and a random forest-based permutation importance classifier. We compared the area under the receiver operating characteristic (AUROC) values and misclassification rates of the different models and observed that all four prediction models had AUROC values ranging between 0.774 and 0.785. Furthermore, no significant difference was observed between the AUROC values of the four models. Based on the results, we can confirm that both traditional logistic regression and ML-based models can show favorable performance and can be used to predict early childhood caries, identify ECC high-risk groups, and implement active preventive treatments. However, further research is essential to improving the performance of the prediction model using recent methods, such as deep learning.
Project description:In this work, plasma samples of 5 metabolic syndrome patients and 5 healthy volunteers were collected. Then, high-throughput RNA sequencing was performed to detect the expression of plasma coding RNA.
Project description:Colorectal cancer affects the colon or rectum and is a common global health issue, with 1.1 million new cases occurring yearly. The study aimed to identify gene signatures for the early detection of CRC using machine learning (ML) algorithms utilizing gene expression data. The TCGA-CRC and GSE50760 datasets were pre-processed and subjected to feature selection using the LASSO method in combination with five ML algorithms: Adaboost, Random Forest (RF), Logistic Regression (LR), Gaussian Naive Bayes (GNB), and Support Vector Machine (SVM). The important features were further analyzed for gene expression, correlation, and survival analyses. Validation of the external dataset GSE142279 was also performed. The RF model had the best classification accuracy for both datasets. A feature selection process resulted in the identification of 12 candidate genes, which were subsequently reduced to 3 (CA2, CA7, and ITM2C) through gene expression and correlation analyses. These three genes achieved 100% accuracy in an external dataset. The AUC values for these genes were 99.24%, 100%, and 99.5%, respectively. The survival analysis showed a significant logrank p-value of 0.044 for the final gene signatures. The analysis of tumor immunocyte infiltration showed a weak correlation with the expression of the gene signatures. CA2, CA7, and ITM2C can serve as gene signatures for the early detection of CRC and may provide valuable information for prognostic and therapeutic decision making. Further research is needed to fully understand the potential of these genes in the context of CRC.
Project description:The use of offensive terms in user-generated content on different social media platforms is one of the major concerns for these platforms. The offensive terms have a negative impact on individuals, which may lead towards the degradation of societal and civilized manners. The immense amount of content generated at a higher speed makes it humanly impossible to categorise and detect offensive terms. Besides, it is an open challenge for natural language processing (NLP) to detect such terminologies automatically. Substantial efforts are made for high-resource languages such as English. However, it becomes more challenging when dealing with resource-poor languages such as Urdu. Because of the lack of standard datasets and pre-processing tools for automatic offensive terms detection. This paper introduces a combinatorial pre-processing approach in developing a classification model for cross-platform (Twitter and YouTube) use. The approach uses datasets from two different platforms (Twitter and YouTube) the training and testing the model, which is trained to apply decision tree, random forest and naive Bayes algorithms. The proposed combinatorial pre-processing approach is applied to check how machine learning models behave with different combinations of standard pre-processing techniques for low-resource language in the cross-platform setting. The experimental results represent the effectiveness of the machine learning model over different subsets of traditional pre-processing approaches in building a classification model for automatic offensive terms detection for a low resource language, i.e., Urdu, in the cross-platform scenario. In the experiments, when dataset D1 is used for training and D2 is applied for testing, the pre-processing approach named Stopword removal produced better results with an accuracy of 83.27%. Whilst, in this case, when dataset D2 is used for training and D1 is applied for testing, stopword removal and punctuation removal were observed as a better preprocessing approach with an accuracy of 74.54%. The combinatorial approach proposed in this paper outperformed the benchmark for the considered datasets using classical as well as ensemble machine learning with an accuracy of 82.9% and 97.2% for dataset D1 and D2, respectively.
Project description:An automatic electrocardiogram (ECG) myocardial infarction detection system needs to satisfy several requirements to be efficient in real-world practice. These requirements, such as reliability, less complexity, and high performance in decision-making, remain very important in a realistic clinical environment. In this study, we investigated an automatic ECG myocardial infarction detection system and presented a new approach to evaluate its robustness and durability performance in classifying the myocardial infarction (with no feature extraction) under different noise types. We employed three well-known supervised machine learning models: support vector machine (SVM), k-nearest neighbors (KNN), and random forest (RF), and tested the performance and robustness of these techniques in classifying normal (NOR) and myocardial infarction (MI) using real ECG records from the PTB database after normalization and segmentation of the data, with a suggested inter-patient paradigm separation as well as noise from the MIT-BIH noise stress test database (NSTDB). Finally, we measured four metrics: accuracy, precision, recall, and F1-score. The simulation revealed that all of the models performed well, with values of over 0.50 at lower SNR levels, in terms of all the metrics investigated against different types of noise, indicating that they are encouraging and acceptable under extreme noise situations are are thus considered sustainable and robust models for specific forms of noise. All of the methods tested could be used as ECG myocardial infarction detection tools in real-world practice under challenging circumstances.
Project description:BackgroundThis study applied machine learning (ML) algorithms to construct a model for predicting EN initiation for patients in the intensive care unit (ICU) and identifying populations in need of EN at an early stage.MethodsThis study collected patient information from the Medical Information Mart for Intensive Care IV database. All patients enrolled were split randomly into a training set and a validation set. Six ML models were established to evaluate the initiation of EN, and the best model was determined according to the area under curve (AUC) and accuracy. The best model was interpreted using the Local Interpretable Model-Agnostic Explanations (LIME) algorithm and SHapley Additive exPlanation (SHAP) values.ResultsA total of 53,150 patients participated in the study. They were divided into a training set (42,520, 80%) and a validation set (10,630, 20%). In the validation set, XGBoost had the optimal prediction performance with an AUC of 0.895. The SHAP values revealed that sepsis, sequential organ failure assessment score, and acute kidney injury were the three most important factors affecting EN initiation. The individualized forecasts were displayed using the LIME algorithm.ConclusionThe XGBoost model was established and validated for early prediction of EN initiation in ICU patients.
Project description:BackgroundEpilepsy is the fourth-most common neurological disorder, affecting an estimated 50 million patients globally. Nearly 40% of patients have uncontrolled seizures yet incur 80% of the cost. Anti-epileptic drugs commonly result in resistance and reversion to uncontrolled drug-resistant epilepsy and are often associated with significant adverse effects. This has led to a trial-and-error system in which physicians spend months to years attempting to identify the optimal therapeutic approach.ObjectiveTo investigate the potential clinical utility from the context of optimal therapeutic prediction of characterizing cellular electrophysiology. It is well-established that genomic data alone can sometimes be predictive of effective therapeutic approach. Thus, to assess the predictive power of electrophysiological data, machine learning strategies are implemented to predict a subject's genetically defined class in an in silico model using brief electrophysiological recordings obtained from simulated neuronal networks.MethodsA dynamic network of isogenic neurons is modeled in silico for 1-s for 228 dynamically modeled patients falling into one of three categories: healthy, general sodium channel gain of function, or inhibitory sodium channel loss of function. Data from previous studies investigating the electrophysiological and cellular properties of neurons in vitro are used to define the parameters governing said models. Ninety-two electrophysiological features defining the nature and consistency of network connectivity, activity, waveform shape, and complexity are extracted for each patient network and t-tests are used for feature selection for the following machine learning algorithms: Neural Network, Support Vector Machine, Gaussian Naïve Bayes Classifier, Decision Tree, and Gradient Boosting Decision Tree. Finally, their performance in accurately predicting which genetic category the subjects fall under is assessed.ResultsSeveral machine learning algorithms excel in using electrophysiological data from isogenic neurons to accurately predict genetic class with a Gaussian Naïve Bayes Classifier predicting healthy, gain of function, and overall, with the best accuracy, area under the curve, and F1. The Gradient Boosting Decision Tree performs the best for loss of function models indicated by the same metrics.ConclusionsIt is possible for machine learning algorithms to use electrophysiological data to predict clinically valuable metrics such as optimal therapeutic approach, especially when combining several models.
Project description:The blood flow through the major vessels holds great diagnostic potential for the identification of cardiovascular complications and is therefore routinely assessed with current diagnostic modalities. Heart valves are subject to high hydrodynamic loads which render them prone to premature degradation. Failing native aortic valves are routinely replaced with bioprosthetic heart valves. This type of prosthesis is limited by a durability that is often less than the patient's life expectancy. Frequent assessment of valvular function can therefore help to ensure good long-term outcomes and to plan reinterventions. In this article, we describe how unsupervised novelty detection algorithms can be used to automate the interpretation of blood flow data to improve outcomes through early detection of adverse cardiovascular events without requiring repeated check-ups in a clinical environment. The proposed method was tested in an in-vitro flow loop which allowed simulating a failing aortic valve in a laboratory setting. Aortic regurgitation of increasing severity was deliberately introduced with tube-shaped inserts, preventing complete valve closure during diastole. Blood flow recordings from a flow meter at the location of the ascending aorta were analyzed with the algorithms introduced in this article and a diagnostic index was defined that reflects the severity of valvular degradation. The results indicate that the proposed methodology offers a high sensitivity towards pathological changes of valvular function and that it is capable of automatically identifying valvular degradation. Such methods may be a step towards computer-assisted diagnostics and telemedicine that provide the clinician with novel tools to improve patient care.