Project description:This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.
Project description:Since the initial discovery of a mobilized colistin resistance gene (mcr-1), several other variants have been reported, some of which might have circulated a while beforehand. Publicly available metagenomic data provide an opportunity to reanalyze samples to understand the evolutionary history of recently discovered antimicrobial resistance genes (ARGs). Here, we present a large-scale metagenomic study of 442 Tbp of sequencing reads from 214,095 samples to describe the dissemination and emergence of nine mcr gene variants (mcr-1 to mcr-9). Our results show that the dissemination of each variant is not uniform. Instead, the source and location play a role in the spread. However, the genomic context and the genes themselves remain primarily unchanged. We report evidence of new subvariants occurring in specific environments, such as a highly prevalent and new variant of mcr-9. This work emphasizes the importance of sharing genomic data for the surveillance of ARGs in our understanding of antimicrobial resistance. IMPORTANCE The ever-growing collection of metagenomic samples available in public data repositories has the potential to reveal new details on the emergence and dissemination of mobilized colistin resistance genes. Our analysis of metagenomes deposited online in the last 10 years shows that the environmental distribution of mcr gene variants depends on sampling source and location, possibly leading to the emergence of new variants, although the contig on which the mcr genes were found remained consistent.
Project description:A major goal of metagenomics is to identify and study the entire collection of microbial species in a set of targeted samples. We describe a statistical metagenomic algorithm that simultaneously identifies microbial species and estimates their abundances without using reference genomes. As a trade-off, we require multiple metagenomic samples, usually ≥10 samples, to get highly accurate binning results. Compared to reference-free methods based primarily on k-mer distributions or coverage information, the proposed approach achieves a higher species binning accuracy and is particularly powerful when sequencing coverage is low. We demonstrated the performance of this new method through both simulation and real metagenomic studies. The MetaGen software is available at https://github.com/BioAlgs/MetaGen .
Project description:MotivationPleiotropic SNPs are associated with multiple traits. Such SNPs can help pinpoint biological processes with an effect on multiple traits or point to a shared etiology between traits. We present PolarMorphism, a new method for the identification of pleiotropic SNPs from genome-wide association studies (GWAS) summary statistics. PolarMorphism can be readily applied to more than two traits or whole trait domains. PolarMorphism makes use of the fact that trait-specific SNP effect sizes can be seen as Cartesian coordinates and can thus be converted to polar coordinates r (distance from the origin) and theta (angle with the Cartesian x-axis, in the case of two traits). r describes the overall effect of a SNP, while theta describes the extent to which a SNP is shared. r and theta are used to determine the significance of SNP sharedness, resulting in a P-value per SNP that can be used for further analysis.ResultsWe apply PolarMorphism to a large collection of publicly available GWAS summary statistics enabling the construction of a pleiotropy network that shows the extent to which traits share SNPs. We show how PolarMorphism can be used to gain insight into relationships between traits and trait domains and contrast it with genetic correlation. Furthermore, pathway analysis of the newly discovered pleiotropic SNPs demonstrates that analysis of more than two traits simultaneously yields more biologically relevant results than the combined results of pairwise analysis of the same traits. Finally, we show that PolarMorphism is more efficient and more powerful than previously published methods.Availability and implementationcode: https://github.com/UMCUGenetics/PolarMorphism, results: 10.5281/zenodo.5844193.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Shared parameter models (SPMs) are a useful approach to addressing bias from informative dropout in longitudinal studies. In SPMs it is typically assumed that the longitudinal outcome process and the dropout time are independent, given random effects and observed covariates. However, this conditional independence assumption is unverifiable. Currently, sensitivity analysis strategies for this unverifiable assumption of SPMs are underdeveloped. In principle, parameters that can and cannot be identified by the observed data should be clearly separated in sensitivity analyses, and sensitivity parameters should not influence the model fit to the observed data. For SPMs this is difficult because it is not clear how to separate the observed data likelihood from the distribution of the missing data given the observed data (i.e., 'extrapolation distribution'). In this article, we propose a new approach for transparent sensitivity analyses for informative dropout that separates the observed data likelihood and the extrapolation distribution, using a typical SPM as a working model for the complete data generating mechanism. For this model, the default extrapolation distribution is a skew-normal distribution (i.e., it is available in a closed form). We propose anchoring the sensitivity analysis on the default extrapolation distribution under the specified SPM and calibrate the sensitivity parameters using the observed data for subjects who drop out. The proposed approach is used to address informative dropout in the HIV Epidemiology Research Study.
Project description:The availability of metagenomic sequencing data, generated by sequencing DNA pooled from multiple microbes living jointly, has increased sharply in the last few years with developments in sequencing technology. Characterizing the contents of metagenomic samples is a challenging task, which has been extensively attempted by both supervised and unsupervised techniques, each with its own limitations. Common to practically all the methods is the processing of single samples only; when multiple samples are sequenced, each is analyzed separately and the results are combined. In this paper we propose to perform a combined analysis of a set of samples in order to obtain a better characterization of each of the samples, and provide two applications of this principle. First, we use an unsupervised probabilistic mixture model to infer hidden components shared across metagenomic samples. We incorporate the model in a novel framework for studying association of microbial sequence elements with phenotypes, analogous to the genome-wide association studies performed on human genomes: We demonstrate that stratification may result in false discoveries of such associations, and that the components inferred by the model can be used to correct for this stratification. Second, we propose a novel read clustering (also termed "binning") algorithm which operates on multiple samples simultaneously, leveraging on the assumption that the different samples contain the same microbial species, possibly in different proportions. We show that integrating information across multiple samples yields more precise binning on each of the samples. Moreover, for both applications we demonstrate that given a fixed depth of coverage, the average per-sample performance generally increases with the number of sequenced samples as long as the per-sample coverage is high enough.
Project description:The study of protein-protein interactions is essential to define the molecular networks that contribute to maintain homeostasis of an organism's body functions. Disruptions in protein interaction networks have been shown to result in diseases in both humans and animals. Monogenic diseases disrupting biochemical pathways such as hereditary coagulopathies (e.g. hemophilia), provided a deep insight in the biochemical pathways of acquired coagulopathies of complex diseases. Indeed, a variety of complex liver diseases can lead to decreased synthesis of the same set of coagulation factors as in hemophilia. Similarly, more complex diseases such as different cancers have been shown to result from malfunctions of common proteins pathways. In order to discover, in high throughput, the molecular underpinnings of poorly characterized diseases, we present a statistical method to identify shared protein interaction network(s) between diseases. Integrating (i) a protein interaction network with (ii) disease to protein relationships derived from mining Gene Ontology annotations and the biomedical literature with natural language understanding (PhenoGO), we identified protein-protein interactions that were associated with pairs of diseases and calculated the statistical significance of the occurrence of interactions in the protein interaction knowledgebase. Significant correlations between diseases and shared protein networks were identified and evaluated in this study, demonstrating the high precision of the approach and correct non-trivial predictions, signifying the potential for discovery. In conclusion, we demonstrate that the associations between diseases are directly correlated to their underlying protein-protein interaction networks, possibly providing insight into the underlying molecular mechanisms of phenotypes and biological processes disrupted in related diseases.
Project description:BackgroundSequence signatures, as defined by the frequencies of k-tuples (or k-mers, k-grams), have been used extensively to compare genomic sequences of individual organisms, to identify cis-regulatory modules, and to study the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of metagenomic samples from a variety of different environments have been generated. The assembly of these reads can be difficult and analysis methods based on mapping reads to genes or pathways are also restricted by the availability and completeness of existing databases. Sequence-signature-based methods, however, do not need the complete genomes or existing databases and thus, can potentially be very useful for the comparison of metagenomic samples using NGS read data. Still, the applications of sequence signature methods for the comparison of metagenomic samples have not been well studied.ResultsWe studied several dissimilarity measures, including d2, d2* and d2S recently developed from our group, a measure (hereinafter noted as Hao) used in CVTree developed from Hao's group (Qi et al., 2004), measures based on relative di-, tri-, and tetra-nucleotide frequencies as in Willner et al. (2009), as well as standard lp measures between the frequency vectors, for the comparison of metagenomic samples using sequence signatures. We compared their performance using a series of extensive simulations and three real next-generation sequencing (NGS) metagenomic datasets: 39 fecal samples from 33 mammalian host species, 56 marine samples across the world, and 13 fecal samples from human individuals. Results showed that the dissimilarity measure d2S can achieve superior performance when comparing metagenomic samples by clustering them into different groups as well as recovering environmental gradients affecting microbial samples. New insights into the environmental factors affecting microbial compositions in metagenomic samples are obtained through the analyses. Our results show that sequence signatures of the mammalian gut are closely associated with diet and gut physiology of the mammals, and that sequence signatures of marine communities are closely related to location and temperature.ConclusionsSequence signatures can successfully reveal major group and gradient relationships among metagenomic samples from NGS reads without alignment to reference databases. The d2S dissimilarity measure is a good choice in all application scenarios. The optimal choice of tuple size depends on sequencing depth, but it is quite robust within a range of choices for moderate sequencing depths.
Project description:Covering: up to 2021Metagenomics has yielded massive amounts of sequencing data offering a glimpse into the biosynthetic potential of the uncultivated microbial majority. While genome-resolved information about microbial communities from nearly every environment on earth is now available, the ability to accurately predict biocatalytic functions directly from sequencing data remains challenging. Compared to primary metabolic pathways, enzymes involved in secondary metabolism often catalyze specialized reactions with diverse substrates, making these pathways rich resources for the discovery of new enzymology. To date, functional insights gained from studies on environmental DNA (eDNA) have largely relied on PCR- or activity-based screening of eDNA fragments cloned in fosmid or cosmid libraries. As an alternative, shotgun metagenomics holds underexplored potential for the discovery of new enzymes directly from eDNA by avoiding common biases introduced through PCR- or activity-guided functional metagenomics workflows. However, inferring new enzyme functions directly from eDNA is similar to searching for a 'needle in a haystack' without direct links between genotype and phenotype. The goal of this review is to provide a roadmap to navigate shotgun metagenomic sequencing data and identify new candidate biosynthetic enzymes. We cover both computational and experimental strategies to mine metagenomes and explore protein sequence space with a spotlight on natural product biosynthesis. Specifically, we compare in silico methods for enzyme discovery including phylogenetics, sequence similarity networks, genomic context, 3D structure-based approaches, and machine learning techniques. We also discuss various experimental strategies to test computational predictions including heterologous expression and screening. Finally, we provide an outlook for future directions in the field with an emphasis on meta-omics, single-cell genomics, cell-free expression systems, and sequence-independent methods.
Project description:CRISPR-associated Tn7 transposons (CASTs) co-opt cas genes for RNA-guided transposition. CASTs are exceedingly rare in genomic databases; recent surveys have reported Tn7-like transposons that co-opt Type I-F, I-B, and V-K CRISPR effectors. Here, we expand the diversity of reported CAST systems via a bioinformatic search of metagenomic databases. We discover architectures for all known CASTs, including arrangements of the Cascade effectors, target homing modalities, and minimal V-K systems. We also describe families of CASTs that have co-opted the Type I-C and Type IV CRISPR-Cas systems. Our search for non-Tn7 CASTs identifies putative candidates that include a nuclease dead Cas12. These systems shed light on how CRISPR systems have coevolved with transposases and expand the programmable gene-editing toolkit.