Dataset Information

Statistical methods for detecting differentially abundant features in clinical metagenomic samples.

ABSTRACT: Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them.We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/.

SUBMITTER: White JR

PROVIDER: S-EPMC2661018 | biostudies-literature | 2009 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Statistical methods for detecting differentially abundant features in clinical metagenomic samples.

White James Robert JR Nagarajan Niranjan N Pop Mihai M

PLoS computational biology 20090410 4

Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial comm ...[more]

PMID: 19360128

Similar Datasets

Project description:BACKGROUND:Microbial longitudinal studies are powerful experimental designs utilized to classify diseases, determine prognosis, and analyze microbial systems dynamics. In longitudinal studies, only identifying differential features between two phenotypes does not provide sufficient information to determine whether a change in the relative abundance is short-term or continuous. Furthermore, sample collection in longitudinal studies suffers from all forms of variability such as a different number of subjects per phenotypic group, a different number of samples per subject, and samples not collected at consistent time points. These inconsistencies are common in studies that collect samples from human subjects. RESULTS:We present MetaLonDA, an R package that is capable of identifying significant time intervals of differentially abundant microbial features. MetaLonDA is flexible such that it can perform differential abundance tests despite inconsistencies associated with sample collection. Extensive experiments on simulated datasets quantitatively demonstrate the effectiveness of MetaLonDA with significant improvement over alternative methods. We applied MetaLonDA to the DIABIMMUNE cohort ( https://pubs.broadinstitute.org/diabimmune ) substantiating significant early lifetime intervals of exposure to Bacteroides and Bifidobacterium in Finnish and Russian infants. Additionally, we established significant time intervals during which novel differentially relative abundant microbial genera may contribute to aberrant immunogenicity and development of autoimmune disease. CONCLUSION:MetaLonDA is computationally efficient and can be run on desktop machines. The identified differentially abundant features and their time intervals have the potential to distinguish microbial biomarkers that may be used for microbial reconstitution through bacteriotherapy, probiotics, or antibiotics. Moreover, MetaLonDA can be applied to any longitudinal count data such as metagenomic sequencing, 16S rRNA gene sequencing, or RNAseq. MetaLonDA is publicly available on CRAN ( https://CRAN.R-project.org/package=MetaLonDA ).

Project description:BackgroundNumerous protocols for viral enrichment and genome amplification have been created. However, the direct identification of viral genomes from clinical specimens using next-generation sequencing (NGS) still has its challenges. As a selected viral nucleic acid extraction method may determine the sensitivity and reliability of NGS, it is still valuable to evaluate the extraction efficiency of different extraction kits using clinical specimens directly.ResultsIn this study, we performed qRT-PCR and viral metagenomic analysis of the extraction efficiency of four commonly used Qiagen extraction kits: QIAamp Viral RNA Mini Kit (VRMK), QIAamp MinElute Virus Spin Kit (MVSK), RNeasy Mini Kit (RMK), and RNeasy Plus Micro Kit (RPMK), using a mixed respiratory clinical sample without any pre-treatment. This sample contained an adenovirus (ADV), influenza virus A (Flu A), human parainfluenza virus 3 (PIV3), human coronavirus OC43 (OC43), and human metapneumovirus (HMPV). The quantity and quality of the viral extracts were significantly different among these kits. The highest threshold cycle(Ct)values for ADV and OC43 were obtained by using the RPMK. The MVSK had the lowest Ct values for ADV and PIV3. The RMK revealed the lowest detectability for HMPV and PIV3. The most effective rate of NGS data at 67.47% was observed with the RPMK. The other three kits ranged between 12.1-26.79% effectiveness rates for the NGS data. Most importantly, compared to the other three kits the highest proportion of non-host reads was obtained by the RPMK. The MVSK performed best with the lowest Ct value of 20.5 in the extraction of ADV, while the RMK revealed the best extraction efficiency by NGS analysis.ConclusionsThe evaluation of viral nucleic acid extraction efficiency is different between NGS and qRT-PCR analysis. The RPMK was most applicable for the metagenomic analysis of viral RNA and enabled more sensitive identification of the RNA virus genome in respiratory clinical samples. In addition, viral RNA extraction kits were also applicable for metagenomic analysis of the DNA virus. Our results highlighted the importance of nucleic acid extraction kit selection, which has a major impact on the yield and number of viral reads by NGS analysis. Therefore, the choice of extraction method for a given viral pathogen needs to be carefully considered.

Project description:BackgroundPeriod 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed.ResultsWe compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~21% and ~19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS).ConclusionsFor estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {AA, TT, TA}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers.

Dataset Information

Statistical methods for detecting differentially abundant features in clinical metagenomic samples.

Publications

Statistical methods for detecting differentially abundant features in clinical metagenomic samples.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets