Project description:BackgroundThe exponential decrease in molecular sequencing cost generates unprecedented amounts of data. Hence, scalable methods to analyze these data are required. Phylogenetic (or Evolutionary) Placement methods identify the evolutionary provenance of anonymous sequences with respect to a given reference phylogeny. This increasingly popular method is deployed for scrutinizing metagenomic samples from environments such as water, soil, or the human gut.Novel methodsHere, we present novel and, more importantly, highly scalable methods for analyzing phylogenetic placements of metagenomic samples. More specifically, we introduce methods for (a) visualizing differences between samples and their correlation with associated meta-data on the reference phylogeny, (b) clustering similar samples using a variant of the k-means method, and (c) finding phylogenetic factors using an adaptation of the Phylofactorization method. These methods enable to interpret metagenomic data in a phylogenetic context, to find patterns in the data, and to identify branches of the phylogeny that are driving these patterns.ResultsTo demonstrate the scalability and utility of our methods, as well as to provide exemplary interpretations of our methods, we applied them to 3 publicly available datasets comprising 9782 samples with a total of approximately 168 million sequences. The results indicate that new biological insights can be attained via our methods.
Project description:SUMMARY:We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command-line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven. AVAILABILITY AND IMPLEMENTATION:Both genesis and gappa are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:MotivationConsider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction.ResultsWe introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice.Availability and implementationThe software and data are available at https://github.com/balabanmetin/misa and https://github.com/balabanmetin/misa-data.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze data sets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.
Project description:MotivationIn most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results.ResultsWe present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence datasets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results.Availability and implementationFreely available under GPLv3 at http://github.com/lczech/gappa.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:BackgroundOngoing innovation in phylogenetics and evolutionary biology has been accompanied by a proliferation of software tools, data formats, analytical techniques and web servers. This brings with it the challenge of integrating phylogenetic and other related biological data found in a wide variety of formats, and underlines the need for reusable software that can read, manipulate and transform this information into the various forms required to build computational pipelines.ResultsWe built a Python software library for working with phylogenetic data that is tightly integrated with Biopython, a broad-ranging toolkit for computational biology. Our library, Bio.Phylo, is highly interoperable with existing libraries, tools and standards, and is capable of parsing common file formats for phylogenetic trees, performing basic transformations and manipulations, attaching rich annotations, and visualizing trees. We unified the modules for working with the standard file formats Newick, NEXUS and phyloXML behind a consistent and simple API, providing a common set of functionality independent of the data source.ConclusionsBio.Phylo meets a growing need in bioinformatics for working with heterogeneous types of phylogenetic data. By supporting interoperability with multiple file formats and leveraging existing Biopython features, this library simplifies the construction of phylogenetic workflows. We also provide examples of the benefits of building a community around a shared open-source project. Bio.Phylo is included with Biopython, available through the Biopython website, http://biopython.org.
Project description:Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them.We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/.
Project description:BackgroundNumerous protocols for viral enrichment and genome amplification have been created. However, the direct identification of viral genomes from clinical specimens using next-generation sequencing (NGS) still has its challenges. As a selected viral nucleic acid extraction method may determine the sensitivity and reliability of NGS, it is still valuable to evaluate the extraction efficiency of different extraction kits using clinical specimens directly.ResultsIn this study, we performed qRT-PCR and viral metagenomic analysis of the extraction efficiency of four commonly used Qiagen extraction kits: QIAamp Viral RNA Mini Kit (VRMK), QIAamp MinElute Virus Spin Kit (MVSK), RNeasy Mini Kit (RMK), and RNeasy Plus Micro Kit (RPMK), using a mixed respiratory clinical sample without any pre-treatment. This sample contained an adenovirus (ADV), influenza virus A (Flu A), human parainfluenza virus 3 (PIV3), human coronavirus OC43 (OC43), and human metapneumovirus (HMPV). The quantity and quality of the viral extracts were significantly different among these kits. The highest threshold cycle(Ct)values for ADV and OC43 were obtained by using the RPMK. The MVSK had the lowest Ct values for ADV and PIV3. The RMK revealed the lowest detectability for HMPV and PIV3. The most effective rate of NGS data at 67.47% was observed with the RPMK. The other three kits ranged between 12.1-26.79% effectiveness rates for the NGS data. Most importantly, compared to the other three kits the highest proportion of non-host reads was obtained by the RPMK. The MVSK performed best with the lowest Ct value of 20.5 in the extraction of ADV, while the RMK revealed the best extraction efficiency by NGS analysis.ConclusionsThe evaluation of viral nucleic acid extraction efficiency is different between NGS and qRT-PCR analysis. The RPMK was most applicable for the metagenomic analysis of viral RNA and enabled more sensitive identification of the RNA virus genome in respiratory clinical samples. In addition, viral RNA extraction kits were also applicable for metagenomic analysis of the DNA virus. Our results highlighted the importance of nucleic acid extraction kit selection, which has a major impact on the yield and number of viral reads by NGS analysis. Therefore, the choice of extraction method for a given viral pathogen needs to be carefully considered.
Project description:Unraveling the transcriptional programs that control how cells divide, differentiate, and respond to their environments requires a precise understanding of transcription factors'(TFs) DNA-binding activities. Calling cards (CC) technology uses transposons to capture transient TF binding event at one instant in time and then read them out at a later time. This methodology can also be used to simultaneously measure transcription factor binding and mRNA expression from single cells CC and to record and integrate TF binding events across time in any cell type of interest without the need for purification. Despite these unique advantages, there has been a lack of dedicated bioinformatics tools for the detailed analysis of CC data. Here, we introduce Pycallingcards, a comprehensive Python module specifically designed for the analysis of single-cell and bulk CC data across multiple species. The package introduces two innovative peak callers, CCcaller and MACCs, enhancing the accuracy and speed of pinpointing TF binding sites from CC data. Pycallingcards offers a fully integrated environment for data visualization, motif finding, and comparative analysis with RNA-seq and ChIP-seq datasets. To illustrate its practical application, we have reanalyzed previously published mouse cortex and glioblastoma datasets. This analysis revealed novel cell-type specific binding sites and potential sex-linked TF regulators, furthering our understanding of TF binding and gene expression relationships. Thus, Pycallingcards, with its user-friendly design and seamless interface with the Python data science ecosystem, stands as a critical tool for advancing the analysis of TF function via CC data.
Project description:Alternative splicing is an important mechanism for increasing protein diversity. However, its functional effects are largely unknown. Here, we present our new software workflow composed of the open-source application AltAnalyze and the Cytoscape plugin DomainGraph. Both programs provide an intuitive and comprehensive end-to-end solution for the analysis and visualization of alternative splicing data from Affymetrix Exon and Gene Arrays at the level of proteins, domains, microRNA binding sites, molecular interactions and pathways. Our software tools include easy-to-use graphical user interfaces, rigorous statistical methods (FIRMA, MiDAS and DABG filtering) and do not require prior knowledge of exon array analysis or programming. They provide new methods for automatic interpretation and visualization of the effects of alternative exon inclusion on protein domain composition and microRNA binding sites. These data can be visualized together with affected pathways and gene or protein interaction networks, allowing a straightforward identification of potential biological effects due to alternative splicing at different levels of granularity. Our programs are available at http://www.altanalyze.org and http://www.domaingraph.de. These websites also include extensive documentation, tutorials and sample data.