Dataset Information

Predicting the Number of Bases to Attain Sufficient Coverage in High-Throughput Sequencing Experiments.

ABSTRACT: For many types of high-throughput sequencing experiments, success in downstream analysis depends on attaining sufficient coverage for individual positions in the genome. For example, when identifying single-nucleotide variants de novo, the number of reads supporting a particular variant call determines our confidence in that variant call. If sequenced reads are distributed uniformly along the genome, the coverage of a nucleotide position is easily approximated by a Poisson distribution, with rate equal to average sequencing depth. Unfortunately, as has become well known, high-throughput sequencing data are never uniform. The numerous factors contributing to variation in coverage have resisted attempts at direct modeling and change along with minor adjustments in the underlying technology. We propose a new nonparametric method to predict the portion of a genome that will attain some specified minimum coverage, as a function of sequencing effort, using information from a shallow sequencing experiment from the same library. Simulations show our approach performs well under an array of distributional assumptions that deviate from uniformity. We applied this approach to estimate coverage at varying depths in single-cell whole-genome sequencing data from multiple protocols. These resulted in highly accurate predictions, demonstrating the effectiveness of our approach in analyzing complexity of sequencing libraries and optimizing design of sequencing experiments.

SUBMITTER: Deng C

PROVIDER: S-EPMC7398442 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:BackgroundIdentification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data.Materials and methodsWe have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure.ResultsHere, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes.ConclusionWe show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

Dataset Information

Predicting the Number of Bases to Attain Sufficient Coverage in High-Throughput Sequencing Experiments.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets