Dataset Information

Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods.

ABSTRACT:

Background

Genotype-phenotype predictions are of great importance in genetics. These predictions can help to find genetic mutations causing variations in human beings. There are many approaches for finding the association which can be broadly categorized into two classes, statistical techniques, and machine learning. Statistical techniques are good for finding the actual SNPs causing variation where Machine Learning techniques are good where we just want to classify the people into different categories. In this article, we examined the Eye-color and Type-2 diabetes phenotype. The proposed technique is a hybrid approach consisting of some parts from statistical techniques and remaining from Machine learning.

Results

The main dataset for Eye-color phenotype consists of 806 people. 404 people have Blue-Green eyes where 402 people have Brown eyes. After preprocessing we generated 8 different datasets, containing different numbers of SNPs, using the mutation difference and thresholding at individual SNP. We calculated three types of mutation at each SNP no mutation, partial mutation, and full mutation. After that data is transformed for machine learning algorithms. We used about 9 classifiers, RandomForest, Extreme Gradient boosting, ANN, LSTM, GRU, BILSTM, 1DCNN, ensembles of ANN, and ensembles of LSTM which gave the best accuracy of 0.91, 0.9286, 0.945, 0.94, 0.94, 0.92, 0.95, and 0.96% respectively. Stacked ensembles of LSTM outperformed other algorithms for 1560 SNPs with an overall accuracy of 0.96, AUC = 0.98 for brown eyes, and AUC = 0.97 for Blue-Green eyes. The main dataset for Type-2 diabetes consists of 107 people where 30 people are classified as cases and 74 people as controls. We used different linear threshold to find the optimal number of SNPs for classification. The final model gave an accuracy of 0.97%.

Conclusion

Genotype-phenotype predictions are very useful especially in forensic. These predictions can help to identify SNP variant association with traits and diseases. Given more datasets, machine learning model predictions can be increased. Moreover, the non-linearity in the Machine learning model and the combination of SNPs Mutations while training the model increases the prediction. We considered binary classification problems but the proposed approach can be extended to multi-class classification.

SUBMITTER: Muneeb M

PROVIDER: S-EPMC8056510 | biostudies-literature | 2021 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods.

Muneeb Muhammad M Henschel Andreas A

BMC bioinformatics 20210419 1

<h4>Background</h4>Genotype-phenotype predictions are of great importance in genetics. These predictions can help to find genetic mutations causing variations in human beings. There are many approaches for finding the association which can be broadly categorized into two classes, statistical techniques, and machine learning. Statistical techniques are good for finding the actual SNPs causing variation where Machine Learning techniques are good where we just want to classify the people into diffe ...[more]

PMID: 33874881

Similar Datasets

Project description:BACKGROUND:The ability to confidently predict health outcomes from gene expression would catalyze a revolution in molecular diagnostics. Yet, the goal of developing actionable, robust, and reproducible predictive signatures of phenotypes such as clinical outcome has not been attained in almost any disease area. Here, we report a comprehensive analysis spanning prediction tasks from ulcerative colitis, atopic dermatitis, diabetes, to many cancer subtypes for a total of 24 binary and multiclass prediction problems and 26 survival analysis tasks. We systematically investigate the influence of gene subsets, normalization methods and prediction algorithms. Crucially, we also explore the novel use of deep representation learning methods on large transcriptomics compendia, such as GTEx and TCGA, to boost the performance of state-of-the-art methods. The resources and findings in this work should serve as both an up-to-date reference on attainable performance, and as a benchmarking resource for further research. RESULTS:Approaches that combine large numbers of genes outperformed single gene methods consistently and with a significant margin, but neither unsupervised nor semi-supervised representation learning techniques yielded consistent improvements in out-of-sample performance across datasets. Our findings suggest that using l2-regularized regression methods applied to centered log-ratio transformed transcript abundances provide the best predictive analyses overall. CONCLUSIONS:Transcriptomics-based phenotype prediction benefits from proper normalization techniques and state-of-the-art regularized regression approaches. In our view, breakthrough performance is likely contingent on factors which are independent of normalization and general modeling techniques; these factors might include reduction of systematic errors in sequencing data, incorporation of other data types such as single-cell sequencing and proteomics, and improved use of prior knowledge.

Project description:BackgroundGenotypes are strongly associated with disease phenotypes, particularly in brain disorders. However, the molecular and cellular mechanisms behind this association remain elusive. With emerging multimodal data for these mechanisms, machine learning methods can be applied for phenotype prediction at different scales, but due to the black-box nature of machine learning, integrating these modalities and interpreting biological mechanisms can be challenging. Additionally, the partial availability of these multimodal data presents a challenge in developing these predictive models.MethodTo address these challenges, we developed DeepGAMI, an interpretable neural network model to improve genotype-phenotype prediction from multimodal data. DeepGAMI leverages functional genomic information, such as eQTLs and gene regulation, to guide neural network connections. Additionally, it includes an auxiliary learning layer for cross-modal imputation allowing the imputation of latent features of missing modalities and thus predicting phenotypes from a single modality. Finally, DeepGAMI uses integrated gradient to prioritize multimodal features for various phenotypes.ResultsWe applied DeepGAMI to several multimodal datasets including genotype and bulk and cell-type gene expression data in brain diseases, and gene expression and electrophysiology data of mouse neuronal cells. Using cross-validation and independent validation, DeepGAMI outperformed existing methods for classifying disease types, and cellular and clinical phenotypes, even using single modalities (e.g., AUC score of 0.79 for Schizophrenia and 0.73 for cognitive impairment in Alzheimer's disease).ConclusionWe demonstrated that DeepGAMI improves phenotype prediction and prioritizes phenotypic features and networks in multiple multimodal datasets in complex brains and brain diseases. Also, it prioritized disease-associated variants, genes, and regulatory networks linked to different phenotypes, providing novel insights into the interpretation of gene regulatory mechanisms. DeepGAMI is open-source and available for general use.

Project description:BackgroundB-cell epitopes play important roles in vaccine design, clinical diagnosis, and antibody production. Although some models have been developed to predict linear or conformational B-cell epitopes, their performance is still unsatisfactory. Hundreds of thousands of linear B-cell epitope data have accumulated in the Immune Epitope Database (IEDB). These data can be explored using the deep learning methods, in order to create better predictive models for linear B-cell epitopes.ResultsAfter data cleaning, we obtained 240,563 peptide samples with experimental evidence from the IEDB database, including 25,884 linear B-cell epitopes and 214,679 non-epitopes. Based on the peptide center, we adapted each peptide to the same length by trimming or extending. A random portion of the data, with the same amount of epitopes and non-epitopes, were set aside as test dataset. Then a same number of epitopes and non-epitopes were randomly selected from the remaining data to build a classifier with the feedforward deep neural network. We built eleven classifiers to form an ensemble prediction model. The model will report a peptide as an epitope if it was classified as epitope by all eleven classifiers. Then we used the test data set to evaluate the performance of the model using the area value under the receiver operating characteristic (ROC) curve (AUC) as an indicator. We established 40 models to predict linear B-cell epitopes of length from 11 to 50 separately, and found that the AUC value increased with the length and tended to be stable when the length was 38. Repeated results showed that the models constructed by this method were robust. Tested on our and two public test datasets, our models outperformed current major models available.ConclusionsWe applied the feedforward deep neural network to the large amount of linear B-cell epitope data with experimental evidence in the IEDB database, and constructed ensemble prediction models with better performance than the current major models available. We named the models as DLBEpitope and provided web services using the models at http://ccb1.bmi.ac.cn:81/dlbepitope/.

Dataset Information

Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods.

Background

Results

Conclusion

Publications

Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets