Project description:BackgroundSingle-step genomic predictions obtained from a breeding value model require calculating the inverse of the genomic relationship matrix [Formula: see text]. The Algorithm for Proven and Young (APY) creates a sparse representation of [Formula: see text] with a low computational cost. APY consists of selecting a group of core animals and expressing the breeding values of the remaining animals as a linear combination of those from the core animals plus an error term. The objectives of this study were to: (1) extend APY to marker effects models; (2) derive equations for marker effect estimates when APY is used for breeding value models, and (3) show the implication of selecting a specific group of core animals in terms of a marker effects model.ResultsWe derived a family of marker effects models called APY-SNP-BLUP. It differs from the classic marker effects model in that the row space of the genotype matrix is reduced and an error term is fitted for non-core animals. We derived formulas for marker effect estimates that take this error term in account. The prediction error variance (PEV) of the marker effect estimates depends on the PEV for core animals but not directly on the PEV of the non-core animals. We extended the APY-SNP-BLUP to include a residual polygenic effect and accommodate non-genotyped animals. We show that selecting a specific group of core animals is equivalent to select a subspace of the row space of the genotype matrix. As the number of core animals increases, subspaces corresponding to different sets of core animals tend to overlap, showing that random selection of core animals is algebraically justified.ConclusionsThe APY-(ss)GBLUP models can be expressed in terms of marker effect models. When the number of core animals is equal to the rank of the genotype matrix, APY-SNP-BLUP is identical to the classic marker effects model. If the number of core animals is less than the rank of the genotype matrix, genotypes for non-core animals are imputed as a linear combination of the genotypes of the core animals. For estimating SNP effects, only relationships and estimated breeding values for core animals are needed.
Project description:The objectives of this study were to develop an efficient algorithm for calculating prediction error variances (PEVs) for genomic best linear unbiased prediction (GBLUP) models using the Algorithm for Proven and Young (APY), extend it to single-step GBLUP (ssGBLUP), and apply this algorithm for approximating the theoretical reliabilities for single- and multiple-trait models in ssGBLUP. The PEV with APY was calculated by block sparse inversion, efficiently exploiting the sparse structure of the inverse of the genomic relationship matrix with APY. Single-step GBLUP reliabilities were approximated by combining reliabilities with and without genomic information in terms of effective record contributions. Multi-trait reliabilities relied on single-trait results adjusted using the genetic and residual covariance matrices among traits. Tests involved two datasets provided by the American Angus Association. A small dataset (Data1) was used for comparing the approximated reliabilities with the reliabilities obtained by the inversion of the left-hand side of the mixed model equations. A large dataset (Data2) was used for evaluating the computational performance of the algorithm. Analyses with both datasets used single-trait and three-trait models. The number of animals in the pedigree ranged from 167,951 in Data1 to 10,213,401 in Data2, with 50,000 and 20,000 genotyped animals for single-trait and multiple-trait analysis, respectively, in Data1 and 335,325 in Data2. Correlations between estimated and exact reliabilities obtained by inversion ranged from 0.97 to 0.99, whereas the intercept and slope of the regression of the exact on the approximated reliabilities ranged from 0.00 to 0.04 and from 0.93 to 1.05, respectively. For the three-trait model with the largest dataset (Data2), the elapsed time for the reliability estimation was 11 min. The computational complexity of the proposed algorithm increased linearly with the number of genotyped animals and with the number of traits in the model. This algorithm can efficiently approximate the theoretical reliability of genomic estimated breeding values in ssGBLUP with APY for large numbers of genotyped animals at a low cost.
Project description:BackgroundSingle-step genomic best linear unbiased prediction (ssGBLUP) models allow the combination of genomic, pedigree, and phenotypic data into a single model, which is computationally challenging for large genotyped populations. In practice, genotypes of animals without their own phenotype and progeny, so-called genotyped selection candidates, can become available after genomic breeding values have been estimated by ssGBLUP. In some breeding programmes, genomic estimated breeding values (GEBV) for these animals should be known shortly after obtaining genotype information but recomputing GEBV using the full ssGBLUP takes too much time. In this study, first we compare two equivalent formulations of ssGBLUP models, i.e. one that is based on the Woodbury matrix identity applied to the inverse of the genomic relationship matrix, and one that is based on marker equations. Second, we present computationally-fast approaches to indirectly compute GEBV for genotyped selection candidates, without the need to do the full ssGBLUP evaluation.ResultsThe indirect approaches use information from the latest ssGBLUP evaluation and rely on the decomposition of GEBV into its components. The two equivalent ssGBLUP models and indirect approaches were tested on a six-trait calving difficulty model using Irish dairy and beef cattle data that include 2.6 million genotyped animals of which about 500,000 were considered as genotyped selection candidates. When using the same computational approaches, the solving phase of the two equivalent ssGBLUP models showed similar requirements for memory and time per iteration. The computational differences between them were due to the preprocessing phase of the genomic information. Regarding the indirect approaches, compared to GEBV obtained from single-step evaluations including all genotypes, indirect GEBV had correlations higher than 0.99 for all traits while showing little dispersion and level bias.ConclusionsIn conclusion, ssGBLUP predictions for the genotyped selection candidates were accurately approximated using the presented indirect approaches, which are more memory efficient and computationally fast, compared to solving a full ssGBLUP evaluation. Thus, indirect approaches can be used even on a weekly basis to estimate GEBV for newly genotyped animals, while the full single-step evaluation is done only a few times within a year.
Project description:Genome-wide association study (GWAS) summary data have become extremely useful in daily routine data analysis, largely facilitating new methods development and new applications. However, a severe limitation with the current use of GWAS summary data is its exclusive restriction to only linear single nucleotide polymorphism (SNP)-trait association analyses. To further expand the use of GWAS summary data, along with a large sample of individual-level genotypes, we propose a nonparametric method for large-scale imputation of the genetic component of the trait for the given genotypes. The imputed individual-level trait values, along with the individual-level genotypes, make it possible to conduct any analysis as with individual-level GWAS data, including nonlinear SNP-trait associations and predictions. We use the UK Biobank data to highlight the usefulness and effectiveness of the proposed method in three applications that currently cannot be done with only GWAS summary data (for SNP-trait associations): marginal SNP-trait association analysis under non-additive genetic models, detection of SNP-SNP interactions, and genetic prediction of a trait using a nonlinear model of SNPs.
Project description:BackgroundScrapie is an infectious prion disease in sheep. Selective breeding for resistant genotypes of the prion protein gene (PRNP) is an effective way to prevent scrapie outbreaks. Genotyping all selection candidates in a population is expensive but existing pedigree records can help infer the probabilities of genotypes in relatives of genotyped animals.ResultsWe used linear models to predict allele content for the various PRNP alleles found in Icelandic sheep and compiled the available estimates of relative scrapie susceptibility (RSS) associated with PRNP genotypes from the literature. Using the predicted allele content and the genotypic RSS we calculated estimated breeding values (EBV) for RSS. We tested the predictions on simulated data under different scenarios that varied in the proportion of genotyped sheep, genotyping strategy, pedigree recording accuracy, genotyping error rates and assumed heritability of allele content. Prediction of allele content for rare alleles was less successful than for alleles with moderate frequencies. The accuracy of allele content and RSS EBV predictions was not affected by the assumed heritability, but the dispersion of prediction was affected. In a scenario where 40% of rams were genotyped and no errors in genotyping or recorded pedigree, the accuracy of RSS EBV for ungenotyped selection candidates was 0.49. If only 20% of rams were genotyped, or rams and ewes were genotyped randomly, or there were 10% pedigree errors, or there were 2% genotyping errors, the accuracy decreased by 0.07, 0.08, 0.03 and 0.04, respectively. With empirical data, the accuracy of RSS EBV for ungenotyped sheep was 0.46-0.65.ConclusionsA linear model for predicting allele content for the PRNP gene, combined with estimates of relative susceptibility associated with PRNP genotypes, can provide RSS EBV for scrapie resistance for ungenotyped selection candidates with accuracy up to 0.65. These RSS EBV can complement selection strategies based on PRNP genotypes, especially in populations where resistant genotypes are rare.
Project description:Environmental factors interact with internal rules of population regulation, sometimes perturbing systems to alternate dynamics though changes in parameter values. Yet, pinpointing when such changes occur in naturally fluctuating populations is difficult. An algorithmic approach that can identify the timing and magnitude of parameter shifts would facilitate understanding of abrupt ecological transitions with potential to inform conservation and management of species. The "Dynamic Shift Detector" is an algorithm to identify changes in parameter values governing temporal fluctuations in populations with nonlinear dynamics. The algorithm examines population time series data for the presence, location, and magnitude of parameter shifts. It uses an iterative approach to fitting subsets of time series data, then ranks the fit of break point combinations using model selection, assigning a relative weight to each break. We examined the performance of the Dynamic Shift Detector with simulations and two case studies. Under low environmental/sampling noise, the break point sets selected by the Dynamic Shift Detector contained the true simulated breaks with 70-100% accuracy. The weighting tool generally assigned breaks intentionally placed in simulated data (i.e., true breaks) with weights averaging >0.8 and those due to sampling error (i.e., erroneous breaks) with weights averaging <0.2. In our case study examining an invasion process, the algorithm identified shifts in population cycling associated with variations in resource availability. The shifts identified for the conservation case study highlight a decline process that generally coincided with changing management practices affecting the availability of hostplant resources. When interpreted in the context of species biology, the Dynamic Shift Detector algorithm can aid management decisions and identify critical time periods related to species' dynamics. In an era of rapid global change, such tools can provide key insights into the conditions under which population parameters, and their corresponding dynamics, can shift.
Project description:The inverses of the pedigree and genomic relationship matrices (A, G) are required for single-step GBLUP (ssGBLUP). While, inverting A is possible for millions of animals at a linear cost, inverting G has a cubic cost and feasible for at most 150,000 animals, using the current conventional algorithms. The algorithm for proven and young (APY) provides approximations of the regular ssGBLUP by splitting genotyped animals into core and noncore groups, with computational costs being cubic for core and linear for noncore animals. The data consisted of 9,406,096 animals in the pedigree, 6,243,753 weaning weight phenotypes, and 46,949 genotyped animals from 5 breeds, composites, and animals with missing breed information from New Zealand. Aiming to find a core sample for a multibreed sheep population that can provide evaluations similar to those from the regular ssGBLUP, different core types, and core sizes were studied. Core types random, composite, oldest, youngest, the most inbred animals in G (GINB), and in A (AINB) were studied in 5K, 10K, and 20K core sizes (K = 1,000). Romney core was studied in 5K and 10K, and Coopworth-Perendale core was studied in 5K. Correlation and regression coefficient (slope) between GEBV from the non-APY and the APY analyses, as indicators for consistency with non-APY and bias from non-APY, showed a large impact of APY on noncore and a small impact on nongenotyped animals. Breed-based 5K cores resulted in large bias from non-APY even for nongenotyped animals. Random and GINB at 20K core size resulted in the highest consistency with non-APY and the lowest bias from non-APY. However, GINB did not perform as well as Random at lower core sizes. The number of animals from a breed in the core sample was very important for the evaluation of that breed. We observed that cores without Texel or Highlander animals resulted in poor evaluations for those breeds. Solving the mixed model equations, within core type, the smallest core size, and within core size, Random core converged in the least number of iterations. However, APY per se did not necessarily reduce the solving time. Random cores performed the best, as they could give a good coverage on the generations and breeds, representative for the genotyped population. Core size 20K performed better than 5K and 10K, and the optimum core size was found to be 18.8K, according to the eigenvalue decomposition of G.
Project description:Bayesian methods are widely used in the GWAS meta-analysis. But the considerable consumption in both computing time and memory space poses great challenges for large-scale meta-analyses. In this research, we propose an algorithm named SMetABF to rapidly obtain the optimal ABF in the GWAS meta-analysis, where shotgun stochastic search (SSS) is introduced to improve the Bayesian GWAS meta-analysis framework, MetABF. Simulation studies confirm that SMetABF performs well in both speed and accuracy, compared to exhaustive methods and MCMC. SMetABF is applied to real GWAS datasets to find several essential loci related to Parkinson's disease (PD) and the results support the underlying relationship between PD and other autoimmune disorders. Developed as an R package and a web tool, SMetABF will become a useful tool to integrate different studies and identify more variants associated with complex traits.
Project description:We propose a method, SDpop, able to infer sex-linkage caused by recombination suppression typical of sex chromosomes. The method is based on the modeling of the allele and genotype frequencies of individuals of known sex in natural populations. It is implemented in a hierarchical probabilistic framework, accounting for different sources of error. It allows statistical testing for the presence or absence of sex chromosomes, and detection of sex-linked genes based on the posterior probabilities in the model. Furthermore, for gametologous sequences, the haplotype and level of nucleotide polymorphism of each copy can be inferred, as well as the divergence between them. We test the method using simulated data, as well as data from both a relatively recent and an old sex chromosome system (the plant Silene latifolia and humans) and show that, for most cases, robust predictions are obtained with 5 to 10 individuals per sex.
Project description:BackgroundTo obtain predictions that are not biased by selection, the conditional mean of the breeding values must be computed given the data that were used for selection. When single nucleotide polymorphism (SNP) effects have a normal distribution, it can be argued that single-step best linear unbiased prediction (SS-BLUP) yields a conditional mean of the breeding values. Obtaining SS-BLUP, however, requires computing the inverse of the dense matrix G of genomic relationships, which will become infeasible as the number of genotyped animals increases. Also, computing G requires the frequencies of SNP alleles in the founders, which are not available in most situations. Furthermore, SS-BLUP is expected to perform poorly relative to variable selection models such as BayesB and BayesC as marker densities increase.MethodsA strategy is presented for Bayesian regression models (SSBR) that combines all available data from genotyped and non-genotyped animals, as in SS-BLUP, but accommodates a wider class of models. Our strategy uses imputed marker covariates for animals that are not genotyped, together with an appropriate residual genetic effect to accommodate deviations between true and imputed genotypes. Under normality, one formulation of SSBR yields results identical to SS-BLUP, but does not require computing G or its inverse and provides richer inferences. At present, Bayesian regression analyses are used with a few thousand genotyped individuals. However, when SSBR is applied to all animals in a breeding program, there will be a 100 to 200-fold increase in the number of animals and an associated 100 to 200-fold increase in computing time. Parallel computing strategies can be used to reduce computing time. In one such strategy, a 58-fold speedup was achieved using 120 cores.DiscussionIn SSBR and SS-BLUP, phenotype, genotype and pedigree information are combined in a single-step. Unlike SS-BLUP, SSBR is not limited to normally distributed marker effects; it can be used when marker effects have a t distribution, as in BayesA, or mixture distributions, as in BayesB or BayesC ?. Furthermore, it has the advantage that matrix inversion is not required. We have investigated parallel computing to speedup SSBR analyses so they can be used for routine applications.