Project description:ObjectivesTo explore the heterogeneous disability trajectories and construct explainable machine learning models for effective prediction of long-term disability trajectories and understanding the mechanisms of predictions among the elderly Chinese at community level.MethodsThis study retrospectively collected data from the Chinese Longitudinal Healthy Longevity and Happy Family Study between 2002 and 2018. A total of 4149 subjects aged 65 + in 2002 with completed activities of daily living (ADL) information for at least three waves were included. The mixed growth model was used to identify disability trajectories, and five machine learning models were further established to predict disability trajectories using epidemiological variables. An explainable approach was deployed to understand the model's decisions.ResultsThree distinct disability trajectories, including normal class (77.3%), progressive class (15.5%), and high-onset class (7.2%), were identified for three-class prediction. The latter two were further merged into abnormal class, accompanied by normal class for two-class prediction. Machine learning, especially random forest and extreme gradient boosting achieved good performance in both two tasks. ADL, age, leisure activity, cognitive function, and blood pressure were key predictors.ConclusionThe findings suggest that machine learning showed good performance and maybe of additional value in analyzing quality indicators in predicting disability trajectories, thereby providing basis to personalize intervention measures.
Project description:Incipient Alzheimer's Disease (AD) is characterized by a slow onset of clinical symptoms, with pathological brain changes starting several years earlier. Consequently, it is necessary to first understand and differentiate age-related changes in brain regions in the absence of disease, and then to support early and accurate AD diagnosis. However, there is poor understanding of the initial stage of AD; seemingly healthy elderly brains lose matter in regions related to AD, but similar changes can also be found in non-demented subjects having mild cognitive impairment (MCI). By using a Linear Mixed Effects approach, we modelled the change of 166 Magnetic Resonance Imaging (MRI)-based biomarkers available at a 5-year follow up on healthy elderly control (HC, n = 46) subjects. We hypothesized that, by identifying their significant variant (vr) and quasi-variant (qvr) brain regions over time, it would be possible to obtain an age-based null model, which would characterize their normal atrophy and growth patterns as well as the correlation between these two regions. By using the null model on those subjects who had been clinically diagnosed as HC (n = 161), MCI (n = 209) and AD (n = 331), normal age-related changes were estimated and deviation scores (residuals) from the observed MRI-based biomarkers were computed. Subject classification, as well as the early prediction of conversion to MCI and AD, were addressed through residual-based Support Vector Machines (SVM) modelling. We found reductions in most cortical volumes and thicknesses (with evident gender differences) as well as in sub-cortical regions, including greater atrophy in the hippocampus. The average accuracies (ACC) recorded for men and women were: AD-HC: 94.11%, MCI-HC: 83.77% and MCI converted to AD (cAD)-MCI non-converter (sMCI): 76.72%. Likewise, as compared to standard clinical diagnosis methods, SVM classifiers predicted the conversion of cAD to be 1.9 years earlier for females (ACC:72.5%) and 1.4 years earlier for males (ACC:69.0%).
Project description:MOTIVATION: Drug effects are mainly caused by the interactions between drug molecules and their target proteins including primary targets and off-targets. Identification of the molecular mechanisms behind overall drug-target interactions is crucial in the drug design process. RESULTS: We develop a classifier-based approach to identify chemogenomic features (the underlying associations between drug chemical substructures and protein domains) that are involved in drug-target interaction networks. We propose a novel algorithm for extracting informative chemogenomic features by using L(1) regularized classifiers over the tensor product space of possible drug-target pairs. It is shown that the proposed method can extract a very limited number of chemogenomic features without loosing the performance of predicting drug-target interactions and the extracted features are biologically meaningful. The extracted substructure-domain association network enables us to suggest ligand chemical fragments specific for each protein domain and ligand core substructures important for a wide range of protein families. AVAILABILITY: Softwares are available at the supplemental website. CONTACT: yamanishi@bioreg.kyushu-u.ac.jp SUPPLEMENTARY INFORMATION: Datasets and all results are available at http://cbio.ensmp.fr/~yyamanishi/l1binary/ .
Project description:Understanding the relationship between the genome of a cell and its phenotype is a central problem in precision medicine. Nonetheless, genotype-to-phenotype prediction comes with great challenges for machine learning algorithms that limit their use in this setting. The high dimensionality of the data tends to hinder generalization and challenges the scalability of most learning algorithms. Additionally, most algorithms produce models that are complex and difficult to interpret. We alleviate these limitations by proposing strong performance guarantees, based on sample compression theory, for rule-based learning algorithms that produce highly interpretable models. We show that these guarantees can be leveraged to accelerate learning and improve model interpretability. Our approach is validated through an application to the genomic prediction of antimicrobial resistance, an important public health concern. Highly accurate models were obtained for 12 species and 56 antibiotics, and their interpretation revealed known resistance mechanisms, as well as some potentially new ones. An open-source disk-based implementation that is both memory and computationally efficient is provided with this work. The implementation is turnkey, requires no prior knowledge of machine learning, and is complemented by comprehensive tutorials.
Project description:Computational models predicting symptomatic progression at the individual level can be highly beneficial for early intervention and treatment planning for Alzheimer's disease (AD). Individual prognosis is complicated by many factors including the definition of the prediction objective itself. In this work, we present a computational framework comprising machine-learning techniques for 1) modeling symptom trajectories and 2) prediction of symptom trajectories using multimodal and longitudinal data. We perform primary analyses on three cohorts from Alzheimer's Disease Neuroimaging Initiative (ADNI), and a replication analysis using subjects from Australian Imaging, Biomarker & Lifestyle Flagship Study of Ageing (AIBL). We model the prototypical symptom trajectory classes using clinical assessment scores from mini-mental state exam (MMSE) and Alzheimer's Disease Assessment Scale (ADAS-13) at nine timepoints spanned over six years based on a hierarchical clustering approach. Subsequently we predict these trajectory classes for a given subject using magnetic resonance (MR) imaging, genetic, and clinical variables from two timepoints (baseline + follow-up). For prediction, we present a longitudinal Siamese neural-network (LSN) with novel architectural modules for combining multimodal data from two timepoints. The trajectory modeling yields two (stable and decline) and three (stable, slow-decline, fast-decline) trajectory classes for MMSE and ADAS-13 assessments, respectively. For the predictive tasks, LSN offers highly accurate performance with 0.900 accuracy and 0.968 AUC for binary MMSE task and 0.760 accuracy for 3-way ADAS-13 task on ADNI datasets, as well as, 0.724 accuracy and 0.883 AUC for binary MMSE task on replication AIBL dataset.
Project description:The SARS-CoV-2 pandemic highlighted the need for software tools that could facilitate patient triage regarding potential disease severity or even death. In this article, an ensemble of Machine Learning (ML) algorithms is evaluated in terms of predicting the severity of their condition using plasma proteomics and clinical data as input. An overview of AI-based technical developments to support COVID-19 patient management is presented outlining the landscape of relevant technical developments. Based on this review, the use of an ensemble of ML algorithms that analyze clinical and biological data (i.e., plasma proteomics) of COVID-19 patients is designed and deployed to evaluate the potential use of AI for early COVID-19 patient triage. The proposed pipeline is evaluated using three publicly available datasets for training and testing. Three ML "tasks" are defined, and several algorithms are tested through a hyperparameter tuning method to identify the highest-performance models. As overfitting is one of the typical pitfalls for such approaches (mainly due to the size of the training/validation datasets), a variety of evaluation metrics are used to mitigate this risk. In the evaluation procedure, recall scores ranged from 0.6 to 0.74 and F1-score from 0.62 to 0.75. The best performance is observed via Multi-Layer Perceptron (MLP) and Support Vector Machines (SVM) algorithms. Additionally, input data (proteomics and clinical data) were ranked based on corresponding Shapley additive explanation (SHAP) values and evaluated for their prognosticated capacity and immuno-biological credence. This "interpretable" approach revealed that our ML models could discern critical COVID-19 cases predominantly based on patient's age and plasma proteins on B cell dysfunction, hyper-activation of inflammatory pathways like Toll-like receptors, and hypo-activation of developmental and immune pathways like SCF/c-Kit signaling. Finally, the herein computational workflow is corroborated in an independent dataset and MLP superiority along with the implication of the abovementioned predictive biological pathways are corroborated. Regarding limitations of the presented ML pipeline, the datasets used in this study contain less than 1000 observations and a significant number of input features hence constituting a high-dimensional low-sample (HDLS) dataset which could be sensitive to overfitting. An advantage of the proposed pipeline is that it combines biological data (plasma proteomics) with clinical-phenotypic data. Thus, in principle, the presented approach could enable patient triage in a timely fashion if used on already trained models. However, larger datasets and further systematic validation are needed to confirm the potential clinical value of this approach. The code is available on Github: https://github.com/inab-certh/Predicting-COVID-19-severity-through-interpretable-AI-analysis-of-plasma-proteomics.
Project description:Therapeutic antibodies make up a rapidly growing segment of the biologics market. However, rational design of antibodies is hindered by reliance on experimental methods for determining antibody structures. Here, we present DeepAb, a deep learning method for predicting accurate antibody FV structures from sequence. We evaluate DeepAb on a set of structurally diverse, therapeutically relevant antibodies and find that our method consistently outperforms the leading alternatives. Previous deep learning methods have operated as "black boxes" and offered few insights into their predictions. By introducing a directly interpretable attention mechanism, we show our network attends to physically important residue pairs (e.g., proximal aromatics and key hydrogen bonding interactions). Finally, we present a novel mutant scoring metric derived from network confidence and show that for a particular antibody, all eight of the top-ranked mutations improve binding affinity. This model will be useful for a broad range of antibody prediction and design tasks.
Project description:BackgroundLysine acetylation is a crucial type of protein post-translational modification, which is involved in many important cellular processes and serious diseases. However, identification of protein acetylated sites through traditional experiment methods is time-consuming and laborious. Those methods are not suitable to identify a large number of acetylated sites quickly. Therefore, computational methods are still very valuable to accelerate lysine acetylated site finding.ResultIn this study, many biological characteristics of acetylated sites have been investigated, such as the amino acid sequence around the acetylated sites, the physicochemical property of the amino acids and the transition probability of adjacent amino acids. A logistic regression method was then utilized to integrate these information for generating a novel lysine acetylation prediction system named LAceP. When compared with existing methods, LAceP overwhelms most of state-of-the-art methods. Especially, LAceP has a more balanced prediction capability for positive and negative datasets.ConclusionLAceP can integrate different biological features to predict lysine acetylation with high accuracy. An online web server is freely available at http://www.scbit.org/iPTM/.
Project description:BackgroundCardiometabolic multimorbidity (CM) has been found to be associated with higher mortality and functional limitations. However, few studies have investigated the longitudinal association between CM and disability in the Chinese population and whether these associations vary by smoking status.MethodsThe study included 16,754 participants from four waves (2011, 2013, 2015, and 2018) of China Health and Retirement Longitudinal Study (CHARLS) (mean age: 59, female: 51%). CM was assesed at baseline and defined as having two or more of diabetes, stroke, or heart disease. Disability was repeatedly measured by summing the number of impaired activities of daily living (ADL) and instrumental activities of daily living (IADL) during the 7-year follow-up. Linear mixed-effects model was used to determine the association of CM and trajectories of disability and to assess the modification effect of smoking status in these associations.ResultsParticipants with CM at baseline had a faster progression of disability compared to those without CM (CM: β = 0.13, 95% CI: 0.05 to 0.21). Current smokers with CM developed disability faster than their counterparts (Pinteraction for smoking=0.011). In addition, there was a significant association between CM and the annual change of disability in current smokers (β = 0.34, 95% CI: 0.17 to 0.50) while no such association was observed in current non-smokers (β = 0.08, 95% CI: -0.02 to 0.17).ConclusionCM was associated with more a rapid disability progression. Notably, being current smokers may amplify the adverse effects of CM on disability progression.
Project description:Early detection of lung cancer by screening has contributed to reduce lung cancer mortality. Identifying high risk subjects for lung cancer is necessary to maximize the benefits and minimize the harms followed by lung cancer screening. In the present study, individual lung cancer risk in Korea was presented using a risk prediction model. Participants who completed health examinations in 2009 based on the Korean National Health Insurance (KNHI) database (DB) were eligible for the present study. Risk scores were assigned based on the adjusted hazard ratio (HR), and the standardized points for each risk factor were calculated to be proportional to the b coefficients. Model discrimination was assessed using the concordance statistic (c-statistic), and calibration ability assessed by plotting the mean predicted probability against the mean observed probability of lung cancer. Among candidate predictors, age, sex, smoking intensity, body mass index (BMI), presence of chronic obstructive pulmonary disease (COPD), pulmonary tuberculosis (TB), and type 2 diabetes mellitus (DM) were finally included. Our risk prediction model showed good discrimination (c-statistic, 0.810; 95% CI: 0.801-0.819). The relationship between model-predicted and actual lung cancer development correlated well in the calibration plot. When using easily accessible and modifiable risk factors, this model can help individuals make decisions regarding lung cancer screening or lifestyle modification, including smoking cessation.