Dataset Information

In silico phenotyping via co-training for improved phenotype prediction from genotype.

ABSTRACT: Predicting disease phenotypes from genotypes is a key challenge in medical applications in the postgenomic era. Large training datasets of patients that have been both genotyped and phenotyped are the key requisite when aiming for high prediction accuracy. With current genotyping projects producing genetic data for hundreds of thousands of patients, large-scale phenotyping has become the bottleneck in disease phenotype prediction.Here we present an approach for imputing missing disease phenotypes given the genotype of a patient. Our approach is based on co-training, which predicts the phenotype of unlabeled patients based on a second class of information, e.g. clinical health record information. Augmenting training datasets by this type of in silico phenotyping can lead to significant improvements in prediction accuracy. We demonstrate this on a dataset of patients with two diagnostic types of migraine, termed migraine with aura and migraine without aura, from the International Headache Genetics Consortium.Imputing missing disease phenotypes for patients via co-training leads to larger training datasets and improved prediction accuracy in phenotype prediction.The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/co-training.html

SUBMITTER: Roqueiro D

PROVIDER: S-EPMC4765855 | biostudies-literature | 2015 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

In silico phenotyping via co-training for improved phenotype prediction from genotype.

Roqueiro Damian D Witteveen Menno J MJ Anttila Verneri V Terwindt Gisela M GM van den Maagdenberg Arn M J M AM Borgwardt Karsten K

Bioinformatics (Oxford, England) 20150601 12

<h4>Motivation</h4>Predicting disease phenotypes from genotypes is a key challenge in medical applications in the postgenomic era. Large training datasets of patients that have been both genotyped and phenotyped are the key requisite when aiming for high prediction accuracy. With current genotyping projects producing genetic data for hundreds of thousands of patients, large-scale phenotyping has become the bottleneck in disease phenotype prediction.<h4>Results</h4>Here we present an approach for ...[more]

PMID: 26072497

Similar Datasets

Project description:Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially.IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.

Project description:The increasing integration of genomics into routine clinical diagnostics requires reliable computational tools to identify determinants of antimicrobial resistance (AMR) from whole-genome sequencing data. Here, we developed PorinPredict, a bioinformatic tool that predicts defects of the Pseudomonas aeruginosa outer membrane porin OprD, which are strongly associated with reduced carbapenem susceptibility. PorinPredict relies on a database of intact OprD variants and reports inactivating mutations in the coding or promoter region. PorinPredict was validated against 987 carbapenemase-negative P. aeruginosa genomes, of which OprD loss was predicted for 454 out of 522 (87.0%) meropenem-nonsusceptible and 46 out of 465 (9.9%) meropenem-susceptible isolates. OprD loss was also found to be common among carbapenemase-producing isolates, resulting in even further increased MICs. Chromosomal mutations in quinolone resistance-determining regions and OprD loss commonly co-occurred, likely reflecting the restricted use of carbapenems for multidrug-resistant infections as recommended in antimicrobial stewardship programs. In combination with available AMR gene detection tools, PorinPredict provides a robust and standardized approach to link P. aeruginosa phenotypes to genotypes. IMPORTANCE Pseudomonas aeruginosa is a major cause of multidrug-resistant nosocomial infections. The emergence and spread of clones exhibiting resistance to carbapenems, a class of critical last-line antibiotics, is therefore closely monitored. Carbapenem resistance is frequently mediated by chromosomal mutations that lead to a defective outer membrane porin OprD. Here, we determined the genetic diversity of OprD variants across the P. aeruginosa population and developed PorinPredict, a bioinformatic tool that enables the prediction of OprD loss from whole-genome sequencing data. We show a high correlation between predicted OprD loss and meropenem nonsusceptibility irrespective of the presence of carbapenemases, which are a second widespread determinant of carbapenem resistance. Isolates with resistance determinants to other antibiotics were disproportionally affected by OprD loss, possibly due to an increased exposure to carbapenems. Integration of PorinPredict into genomic surveillance platforms will facilitate a better understanding of the clinical impact of OprD modifications and transmission dynamics of resistant clones.

Dataset Information

In silico phenotyping via co-training for improved phenotype prediction from genotype.

Publications

In silico phenotyping via co-training for improved phenotype prediction from genotype.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets