Dataset Information

Identifying genetic determinants of complex phenotypes from whole genome sequence data.

ABSTRACT: BACKGROUND:A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known. RESULTS:To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (infectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than random forest, it was never <50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB. CONCLUSIONS:Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals.

SUBMITTER: Long GS

PROVIDER: S-EPMC6558885 | biostudies-other | 2019 Jun

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Identifying genetic determinants of complex phenotypes from whole genome sequence data.

Long George S GS Hussen Mohammed M Dench Jonathan J Aris-Brosou Stéphane S

BMC genomics 20190610 1

<h4>Background</h4>A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome d ...[more]

PMID: 31182025

Similar Datasets

Project description:BackgroundQuantitative genetic studies suggest the existence of variation at the genome level that affects the ability of cattle to resist to parasitic diseases. The objective of the current study was to identify regions of the bovine genome that are associated with resistance to endo-parasites.MethodsIndividual cattle records were available for Fasciola hepatica-damaged liver from 18 abattoirs. Deregressed estimated breeding values (EBV) for F. hepatica-damaged liver were generated for genotyped animals with a record for F. hepatica-damaged liver and for genotyped sires with a least one progeny record for F. hepatica-damaged liver; 3702 animals were available. In addition, individual cow records for antibody response to F. hepatica on 6388 genotyped dairy cows, antibody response to Ostertagia ostertagi on 8334 genotyped dairy cows and antibody response to Neospora caninum on 4597 genotyped dairy cows were adjusted for non-genetic effects. Genotypes were imputed to whole-sequence; after edits, 14,190,141 single nucleotide polymorphisms (SNPs) and 16,603,644 SNPs were available for cattle with deregressed EBV for F. hepatica-damaged liver and cows with an antibody response to a parasitic disease, respectively. Association analyses were undertaken using linear regression on one SNP at a time, in which a genomic relationship matrix accounted for the relationships between animals.ResultsGenomic regions for F. hepatica-damaged liver were located on Bos taurus autosomes (BTA) 1, 8, 11, 16, 17 and 18; each region included at least one SNP with a p value lower than 10-6. Five SNPs were identified as significant (q value < 0.05) for antibody response to N. caninum and were located on BTA21 or 25. For antibody response to F. hepatica and O. ostertagi, six and nine quantitative trait loci (QTL) regions that included at least one SNP with a p value lower than 10-6 were identified, respectively. Gene set enrichment analysis revealed a significant association between functional annotations related to the olfactory system and QTL that were suggestively associated with endo-parasite phenotypes.ConclusionsA number of novel genomic regions were suggestively associated with endo-parasite phenotypes across the bovine genome and two genomic regions on BTA21 and 25 were associated with antibody response to N. caninum.

Project description:The determination of the relationship between a pair of individuals is a fundamental application of genetics. Previously, we and others have demonstrated that identity-by-descent (IBD) information generated from high-density single-nucleotide polymorphism (SNP) data can greatly improve the power and accuracy of genetic relationship detection. Whole-genome sequencing (WGS) marks the final step in increasing genetic marker density by assaying all single-nucleotide variants (SNVs), and thus has the potential to further improve relationship detection by enabling more accurate detection of IBD segments and more precise resolution of IBD segment boundaries. However, WGS introduces new complexities that must be addressed in order to achieve these improvements in relationship detection. To evaluate these complexities, we estimated genetic relationships from WGS data for 1490 known pairwise relationships among 258 individuals in 30 families along with 46 population samples as controls. We identified several genomic regions with excess pairwise IBD in both the pedigree and control datasets using three established IBD methods: GERMLINE, fastIBD, and ISCA. These spurious IBD segments produced a 10-fold increase in the rate of detected false-positive relationships among controls compared to high-density microarray datasets. To address this issue, we developed a new method to identify and mask genomic regions with excess IBD. This method, implemented in ERSA 2.0, fully resolved the inflated cryptic relationship detection rates while improving relationship estimation accuracy. ERSA 2.0 detected all 1(st) through 6(th) degree relationships, and 55% of 9(th) through 11(th) degree relationships in the 30 families. We estimate that WGS data provides a 5% to 15% increase in relationship detection power relative to high-density microarray data for distant relationships. Our results identify regions of the genome that are highly problematic for IBD mapping and introduce new software to accurately detect 1(st) through 9(th) degree relationships from whole-genome sequence data.

Dataset Information

Identifying genetic determinants of complex phenotypes from whole genome sequence data.

Publications

Identifying genetic determinants of complex phenotypes from whole genome sequence data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure