Project description:With its capacity for high-resolution data output in one region of interest, chromosome conformation capture combined with high-throughput sequencing (4C-seq) is a state-of-the-art next-generation sequencing technique that provides epigenetic insights, and regularly advances current medical research. However, 4C-seq data is complex and prone to biases, and while specialized programs exist, an unbiased, extensive benchmarking is still lacking. Furthermore, neither substantial datasets with fully characterized ground truth, nor simulation programs for realistic 4C-seq data have been published. We conducted a benchmarking study on 54 4C-seq samples from 12 datasets, including original murine BMM, T-cell, and 416B data, and developed a novel 4C-seq simulation software to allow for more detailed comparisons of 4C-seq algorithms on 50 simulated datasets with 10 to 120 samples each.
Project description:At present, it is admitted that RNA-seq is a more powerful and adaptable technique than hybridization arrays. Nevertheless, as RNA-seq needs a more complex data analysis, it has generated a lot of research on algorithms and workflows. This has resulted in an exponential increase of the options at each step of the analysis. Consequently, there is no clear consensus on the appropriate algorithms and pipelines that should be used to analyse RNA-seq data. In the present study, 192 pipelines on 18 samples from 2 human cell lines were evaluated. Absolute gene expression quantification was assessed by non-parametric statistics to measure precision and accuracy. Relative gene expression performance was estimated testing 19 differential expression methods. These results were contrasted in parallel with the microarray HTA 2.0 data from Affymetrix using the same set of samples. All procedures were validated by qRT-PCR on 32 genes in all samples. In addition, this study proposes a new statistical approach for precision and accuracy evaluation on real RNA-seq data. It also weights up the advantages and disadvantages of the algorithms and pipelines tested and gives a guide to select the appropriate pipeline to analyse RNA-seq and microarray data.
Project description:The Virochip microarray (version 4.0) was used to detect viruses in patients from North America with unexplained influenza-like illness at the onset of the 2009 H1N1 pandemic. We used metagenomics-based technologies (the Virochip microarray) and deep sequencing to analyze nasal swab samples from individuals with 2009 H1N1 infection. This Series includes the Virochip microarray data only.
Project description:Targeted metagenomics, also known as metagenetics, is a high-throughput sequencing application focusing on a nucleotide target in a microbiome to describe its taxonomic content. A wide range of bioinformatics pipelines are available to analyze sequencing outputs, and the choice of an appropriate tool is crucial and not trivial. No standard evaluation method exists for estimating the accuracy of a pipeline for targeted metagenomics analyses. This article proposes an evaluation protocol containing real and simulated targeted metagenomics datasets, and adequate metrics allowing us to study the impact of different variables on the biological interpretation of results. This protocol was used to compare six different bioinformatics pipelines in the basic user context: Three common ones (mothur, QIIME and BMP) based on a clustering-first approach and three emerging ones (Kraken, CLARK and One Codex) using an assignment-first approach. This study surprisingly reveals that the effect of sequencing errors has a bigger impact on the results that choosing different amplified regions. Moreover, increasing sequencing throughput increases richness overestimation, even more so for microbiota of high complexity. Finally, the choice of the reference database has a bigger impact on richness estimation for clustering-first pipelines, and on correct taxa identification for assignment-first pipelines. Using emerging assignment-first pipelines is a valid approach for targeted metagenomics analyses, with a quality of results comparable to popular clustering-first pipelines, even with an error-prone sequencing technology like Ion Torrent. However, those pipelines are highly sensitive to the quality of databases and their annotations, which makes clustering-first pipelines still the only reliable approach for studying microbiomes that are not well described.
Project description:Shotgun metagenomic sequencing comprehensively samples the DNA of a microbial sample. Choosing the best bioinformatics processing package can be daunting due to the wide variety of tools available. Here, we assessed publicly available shotgun metagenomics processing packages/pipelines including bioBakery, Just a Microbiology System (JAMS), Whole metaGenome Sequence Assembly V2 (WGSA2), and Woltka using 19 publicly available mock community samples and a set of five constructed pathogenic gut microbiome samples. Also included is a workflow for labelling bacterial scientific names with NCBI taxonomy identifiers for better resolution in assessing results. The Aitchison distance, a sensitivity metric, and total False Positive Relative Abundance were used for accuracy assessments for all pipelines and mock samples. Overall, bioBakery4 performed the best with most of the accuracy metrics, while JAMS and WGSA2, had the highest sensitivities. Furthermore, bioBakery is commonly used and only requires a basic knowledge of command line usage. This work provides an unbiased assessment of shotgun metagenomics packages and presents results assessing the performance of the packages using mock community sequence data.
Project description:Abstract The proper identification of differentially methylated CpGs is central in most epigenetic studies. The Illumina Human Methylation 450k BeadChip is widely used to quantify DNA methylation, nevertheless the design of an appropriate analysis pipeline faces severe challenges due to the convolution of biological and technical variability and the presence of a signal bias between Infinium I and II probe design types. Despite recent attempts to investigate how to analyze DNA methylation data with such an array design, it has not been possible to perform a comprehensive comparison between different bioinformatics pipelines due to the lack of appropriate datasets having both large sample size and sufficient number of technical replicates. Here we perform such a comparative analysis, targeting the problems of reducing the technical variability, eliminating the probe design bias and reducing the batch effect by exploiting two unpublished datasets, which included technical replicates and were profiled for DNA methylation either on peripheral blood, monocytes or muscle biopsies. We evaluated the performance of different analysis pipelines and demonstrated that a) it is critical to correct for the probe design type, since the amplitude of the measured methylation change depends on the underlying chemistry; b) the effect of different normalization schemes is mixed, and the most effective method in our hands were quantile normalization and Beta Mixture Quantile dilation (BMIQ); c) it is beneficial to correct for batch effects. In conclusion, our comparative analysis using a comprehensive dataset suggests an efficient pipeline for proper identification of differentially methylated CpGs using the Illumina 450k arrays. DNA samples from peripheral blood or CD14+ monocytes were included in the study. DNA methylation levels were profiled using Illumina 450K arrays. Specifically, 50 biological sample replicates from PB and 36 biological sample replicates from monocytes were randomly assigned to 8 BeadChips with technical replicates and processed in one run (a total of 96 DNA samples). Eight samples were technically replicated in pairs, while one sample was represented in a trio of replicates. Different analysis pipelines were compared, however, the file uploaded refers to the best scored. In our publication we used this one to make all analyses and conclusions.