Project description:ObjectiveDiabetic complications have brought a tremendous burden for diabetic patients, but the problem of predicting diabetic complications is still unresolved. Our aim is to explore the relationship between hemoglobin A1C (HbA1c), insulin (INS), and glucose (GLU) and diabetic complications in combination with individual factors and to effectively predict multiple complications of diabetes.MethodsThis was a real-world study. Data were collected from 40,913 participants with an average age of 48 years from the Department of Endocrinology of Ruijin Hospital in Shanghai. We proposed deep personal multitask prediction of diabetes complication with attentive interactions (DPMP-DC) to predict the five complication models of diabetes, including diabetic retinopathy, diabetic nephropathy, diabetic peripheral neuropathy, diabetic foot disease, and diabetic cardiovascular disease.ResultsOur model has an accuracy rate of 88.01% for diabetic retinopathy, 89.58% for diabetic nephropathy, 85.77% for diabetic neuropathy, 80.56% for diabetic foot disease, and 82.48% for diabetic cardiovascular disease. The multitasking accuracy of multiple complications is 84.67%, and the missed diagnosis rate is 9.07%.ConclusionWe put forward the method of interactive integration with individual factors of patients for the first time in diabetic complications, which reflect the differences between individuals. Our multitask model using the hard sharing mechanism provides better prediction than prior single prediction models.
Project description:Fusarium head blight (FHB) incited by Fusarium graminearum Schwabe is a devastating disease of barley and other cereal crops worldwide. Fusarium head blight is associated with trichothecene mycotoxins such as deoxynivalenol (DON), where contaminated grains are unfit for malting or animal feed industries. While genetically resistant cultivars offer the best economic and environmentally responsible means to mitigate disease, parent lines with adequate resistance are limited in barley. Resistancebreeding based upon quantitative genetic gains has been slow to date, due to intensive labour requirements of disease nurseries. The development of high throughput genome-wide molecular markers, allow application in genomic prediction models. A diverse genomic panel consisting of 400 two-row spring barley lines was assembled to focus on Canadian barley breeding programs. The panel was evaluated for FHB and DON content in three environments and over two years. Moreover, it was genotyped using an Illumina Infinium HTS iSelect custom beadchip array of single nucleotide polymorphic molecular markers (50K SNP), where over 23K molecular markers were polymorphic. Genomic prediction has been successfully demonstrated for reducing FHB and DON content in cereals using various statistically-based models of different underlying assumptions. Herein, we have studied an alternative method basedon machine learning and compare it with a statistical approach. Two encoding techniques were utilized (categorical or Hardy-Weinberg frequencies), followed by selecting essential genomic markers for phenotype prediction. Subsequently, we applied a transformer-based deep learning algorithm to predict FHB and DON. Apart from the transformer method, we also implemented a Residual Fully Connected Neural Network (RFCNN). Pearson correlation coefficients were calculated to compare true vs. predicted outputs. Under most model scenarios, the use of all markers vs. selected markers marginally improved prediction performance except for RFCNN method for FHB (27.6%). Hardy-Weinberg encoding generally improved correlation for FHB (6.9%) and DON (9.6%) for transformer. This study suggests the potential of the transformer based method for genomic prediction of complex traits such as FHB or DON, having performed better or equally compared with existing machine learning and statistical method. To genomic prediction in barley for Fusarium head blight and deoxynivalenol content using a custom Illumina Infinium array (BarleySNP50-JHI) (www.illumina.com). Sample types included leaves from 400 barley genotypes mostly of Canadian origin. This series includes 400 genotypes assayed on an Illumina infinium HTS platform 50K BeadChip.
Project description:Health care is one of the most exciting frontiers in data mining and machine learning. Successful adoption of electronic health records (EHRs) created an explosion in digital clinical data available for analysis, but progress in machine learning for healthcare research has been difficult to measure because of the absence of publicly available benchmark data sets. To address this problem, we propose four clinical prediction benchmarks using data derived from the publicly available Medical Information Mart for Intensive Care (MIMIC-III) database. These tasks cover a range of clinical problems including modeling risk of mortality, forecasting length of stay, detecting physiologic decline, and phenotype classification. We propose strong linear and neural baselines for all four tasks and evaluate the effect of deep supervision, multitask training and data-specific architectural modifications on the performance of neural models.
Project description:Machine learning approaches have had tremendous success in various disciplines. However, such success highly depends on the size and quality of datasets. Scientific datasets are often small and difficult to collect. Currently, improving machine learning performance for small scientific datasets remains a major challenge in many academic fields, such as bioinformatics or medical science. Gradient boosting decision tree (GBDT) is typically optimal for small datasets, while deep learning often performs better for large datasets. This work reports a boosting tree-assisted multitask deep learning (BTAMDL) architecture that integrates GBDT and multitask deep learning (MDL) to achieve near-optimal predictions for small datasets when there exists a large dataset that is well correlated to the small datasets. Two BTAMDL models are constructed, one utilizing purely MDL output as GBDT input while the other admitting additional features in GBDT input. The proposed BTAMDL models are validated on four categories of datasets, including toxicity, partition coefficient, solubility, and solvation. It is found that the proposed BTAMDL models outperform the current state-of-the-art methods in various applications involving small datasets.
Project description:Infectious disease occurs when a person is infected by a pathogen from another person or an animal. It is a problem that causes harm at both individual and macro scales. The Korea Center for Disease Control (KCDC) operates a surveillance system to minimize infectious disease contagions. However, in this system, it is difficult to immediately act against infectious disease because of missing and delayed reports. Moreover, infectious disease trends are not known, which means prediction is not easy. This study predicts infectious diseases by optimizing the parameters of deep learning algorithms while considering big data including social media data. The performance of the deep neural network (DNN) and long-short term memory (LSTM) learning models were compared with the autoregressive integrated moving average (ARIMA) when predicting three infectious diseases one week into the future. The results show that the DNN and LSTM models perform better than ARIMA. When predicting chickenpox, the top-10 DNN and LSTM models improved average performance by 24% and 19%, respectively. The DNN model performed stably and the LSTM model was more accurate when infectious disease was spreading. We believe that this study's models can help eliminate reporting delays in existing surveillance systems and, therefore, minimize costs to society.
Project description:Different types of J-proteins perform distinct functions in chaperone processes and diseases development. Accurate identification of types of J-proteins will provide significant clues to reveal the mechanism of J-proteins and contribute to developing drugs for diseases. In this study, an ensemble predictor called JPPRED for J-protein prediction is proposed with hybrid features, including split amino acid composition (SAAC), pseudo amino acid composition (PseAAC), and position specific scoring matrix (PSSM). To deal with the imbalanced benchmark dataset, the synthetic minority oversampling technique (SMOTE) and undersampling technique are applied. The average sensitivity of JPPRED based on above-mentioned individual feature spaces lies in the range of 0.744-0.851, indicating the discriminative power of these features. In addition, JPPRED yields the highest average sensitivity of 0.875 using the hybrid feature spaces of SAAC, PseAAC, and PSSM. Compared to individual base classifiers, JPPRED obtains more balanced and better performance for each type of J-proteins. To evaluate the prediction performance objectively, JPPRED is compared with previous study. Encouragingly, JPPRED obtains balanced performance for each type of J-proteins, which is significantly superior to that of the existing method. It is anticipated that JPPRED can be a potential candidate for J-protein prediction.
Project description:Toxicity prediction using quantitative structure-activity relationship has achieved significant progress in recent years. However, most existing machine learning methods in toxicity prediction utilize only one type of feature representation and one type of neural network, which essentially restricts their performance. Moreover, methods that use more than one type of feature representation struggle with the aggregation of information captured within the features since they use predetermined aggregation formulas. In this paper, we propose a deep learning framework for quantitative toxicity prediction using five individual base deep learning models and their own base feature representations. We then propose to adopt a meta ensemble approach using another separate deep learning model to perform aggregation of the outputs of the individual base deep learning models. We train our deep learning models in a weighted multitask fashion combining four quantitative toxicity data sets of LD50, IGC50, LC50, and LC50-DM and minimizing the root-mean-square errors. Compared to the current state-of-the-art toxicity prediction method TopTox on LD50, IGC50, and LC50-DM, that is, three out of four data sets, our method, respectively, obtains 5.46, 16.67, and 6.34% better root-mean-square errors, 6.41, 11.80, and 12.16% better mean absolute errors, and 5.21, 7.36, and 2.54% better coefficients of determination. We named our method QuantitativeTox, and our implementation is available from the GitHub repository https://github.com/Abdulk084/QuantitativeTox.
Project description:Autism spectrum disorder and intellectual disability are comorbid neurodevelopmental disorders with complex genetic architectures. Despite large-scale sequencing studies, only a fraction of the risk genes was identified for both. We present a network-based gene risk prioritization algorithm, DeepND, that performs cross-disorder analysis to improve prediction by exploiting the comorbidity of autism spectrum disorder (ASD) and intellectual disability (ID) via multitask learning. Our model leverages information from human brain gene co-expression networks using graph convolutional networks, learning which spatiotemporal neurodevelopmental windows are important for disorder etiologies and improving the state-of-the-art prediction in single- and cross-disorder settings. DeepND identifies the prefrontal and motor-somatosensory cortex (PFC-MFC) brain region and periods from early- to mid-fetal and from early childhood to young adulthood as the highest neurodevelopmental risk windows for ASD and ID. We investigate ASD- and ID-associated copy-number variation (CNV) regions and report our findings for several susceptibility gene candidates. DeepND can be generalized to analyze any combinations of comorbid disorders.
Project description:Image-based plant phenotyping has been steadily growing and this has steeply increased the need for more efficient image analysis techniques capable of evaluating multiple plant traits. Deep learning has shown its potential in a multitude of visual tasks in plant phenotyping, such as segmentation and counting. Here, we show how different phenotyping traits can be extracted simultaneously from plant images, using multitask learning (MTL). MTL leverages information contained in the training images of related tasks to improve overall generalization and learns models with fewer labels. We present a multitask deep learning framework for plant phenotyping, able to infer three traits simultaneously: (i) leaf count, (ii) projected leaf area (PLA), and (iii) genotype classification. We adopted a modified pretrained ResNet50 as a feature extractor, trained end-to-end to predict multiple traits. We also leverage MTL to show that through learning from more easily obtainable annotations (such as PLA and genotype) we can predict a better leaf count (harder to obtain annotation). We evaluate our findings on several publicly available datasets of top-view images of Arabidopsis thaliana. Experimental results show that the proposed MTL method improves the leaf count mean squared error (MSE) by more than 40%, compared to a single task network on the same dataset. We also show that our MTL framework can be trained with up to 75% fewer leaf count annotations without significantly impacting performance, whereas a single task model shows a steady decline when fewer annotations are available. Code available at https://github.com/andobrescu/Multi_task_plant_phenotyping.
Project description:BackgroundWe present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.MethodsWe employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.ResultsWe predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.ConclusionsIn combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.