Dataset Information

Statistical guidelines for quality control of next-generation sequencing techniques.

ABSTRACT: More and more next-generation sequencing (NGS) data are made available every day. However, the quality of this data is not always guaranteed. Available quality control tools require profound knowledge to correctly interpret the multiplicity of quality features. Moreover, it is usually difficult to know if quality features are relevant in all experimental conditions. Therefore, the NGS community would highly benefit from condition-specific data-driven guidelines derived from many publicly available experiments, which reflect routinely generated NGS data. In this work, we have characterized well-known quality guidelines and related features in big datasets and concluded that they are too limited for assessing the quality of a given NGS file accurately. Therefore, we present new data-driven guidelines derived from the statistical analysis of many public datasets using quality features calculated by common bioinformatics tools. Thanks to this approach, we confirm the high relevance of genome mapping statistics to assess the quality of the data, and we demonstrate the limited scope of some quality features that are not relevant in all conditions. Our guidelines are available at https://cbdm.uni-mainz.de/ngs-guidelines.

SUBMITTER: Sprang M

PROVIDER: S-EPMC8408346 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Background/aimsRecently, next-generation sequencing-based technologies have enabled DNA methylation profiling at high resolution and low cost. Methyl-Seq and Reduced Representation Bisulfite Sequencing (RRBS) are two such technologies that interrogate methylation levels at CpG sites throughout the entire human genome. With rapid reduction of sequencing costs, these technologies will enable epigenotyping of large cohorts for phenotypic association studies. Existing quantification methods for sequencing-based methylation profiling are simplistic and do not deal with the noise due to the random sampling nature of sequencing and various experimental artifacts. Therefore, there is a need to investigate the statistical issues related to the quantification of methylation levels for these emerging technologies, with the goal of developing an accurate quantification method.MethodsIn this paper, we propose two methods for Methyl-Seq quantification. The first method, the Maximum Likelihood estimate, is both conceptually intuitive and computationally simple. However, this estimate is biased at extreme methylation levels and does not provide variance estimation. The second method, based on bayesian hierarchical model, allows variance estimation of methylation levels, and provides a flexible framework to adjust technical bias in the sequencing process.ResultsWe compare the previously proposed binary method, the Maximum Likelihood (ML) method, and the bayesian method. In both simulation and real data analysis of Methyl-Seq data, the bayesian method offers the most accurate quantification. The ML method is slightly less accurate than the bayesian method. But both our proposed methods outperform the original binary method in Methyl-Seq. In addition, we applied these quantification methods to simulation data and show that, with sequencing depth above 40-300 (which varies with different tissue samples) per cleavage site, Methyl-Seq offers a comparable quantification consistency as microarrays.

Dataset Information

Statistical guidelines for quality control of next-generation sequencing techniques.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets