Dataset Information

CoMeta: classification of metagenomes using k-mers.

ABSTRACT: Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (Classification of metagenomes), which assigns a query read (a DNA fragment) into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions. In CoMeta, we used the exact method for read classification using short subsequences (k-mers) and fast program for indexing large set of k-mers. In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons. The presented experimental study confirms that CoMeta outperforms other programs used in this context. CoMeta is available at https://github.com/jkawulok/cometa under a free GNU GPL 2 license.

SUBMITTER: Kawulok J

PROVIDER: S-EPMC4401624 | biostudies-literature | 2015

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

CoMeta: classification of metagenomes using k-mers.

Kawulok Jolanta J Deorowicz Sebastian S

PloS one 20150417 4

Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (Classification of metagenomes), which assign ...[more]

PMID: 25884504

Similar Datasets

Project description:BACKGROUND: Comparison and classification of metagenome samples is one of the major tasks in the study of microbial communities of natural environments or niches on human bodies. Bioinformatics methods play important roles on this task, including 16S rRNA gene analysis and some alignment-based or alignment-free methods on metagenomic data. Alignment-free methods have the advantage of not depending on known genome annotations and therefore have high potential in studying complicated microbiomes. However, the existing alignment-free methods are all based on unsupervised learning strategy (e.g., PCA or hierarchical clustering). These types of methods are powerful in revealing major similarities and grouping relations between microbiome samples, but cannot be applied for discriminating predefined classes of interest which might not be the dominating assortment in the data. Supervised classification is needed in the latter scenario, with the goal of classifying samples into predefined classes and finding the features that can discriminate the classes. The effectiveness of supervised classification with alignment-based features on metagenomic data have been shown in some recent studies. The application of alignment-free supervised classification methods on metagenome data has not been well explored yet. RESULTS: We developed a method for this task using k-tuple frequencies as features counted directly from metagenome short reads and the R-SVM (Recursive SVM) for feature selection and classification. We tested our method on a simulation dataset, a real dataset composed of several known genomes, and a real metagenome NGS short reads dataset. Experiments on simulated data showed that the method can classify the classes almost perfectly and can recover major sequence signatures that distinguish the two classes. On the real human gut metagenome data, the method can discriminate samples of inflammatory bowel disease (IBD) patients from control samples with high accuracy, which cannot be separated when comparing the samples with unsupervised clustering approaches. CONCLUSIONS: The proposed alignment-free supervised classification method can perform well in discriminating of metagenomic samples of predefined classes and in selecting characteristic sequence features for the discrimination. This study shows as an example on the feasibility of using metagenome sequence features of microbiomes on human bodies to study specific human health conditions using supervised machine learning methods.

Project description:Microbial communities play key roles in ocean ecosystems through regulation of biogeochemical processes such as carbon and nutrient cycling, food web dynamics, and gut microbiomes of invertebrates, fish, reptiles, and mammals. Assessments of marine microbial diversity are therefore critical to understanding spatiotemporal variations in microbial community structure and function in ocean ecosystems. With recent advances in DNA shotgun sequencing for metagenome samples and computational analysis, it is now possible to access the taxonomic and genomic content of ocean microbial communities to study their structural patterns, diversity, and functional potential. However, existing taxonomic classification tools depend upon manually curated phylogenetic trees, which can create inaccuracies in metagenomes from less well-characterized communities, such as from ocean water. Herein, we explore the utility of deep learning tools-DeepMicrobes and a novel Residual Network architecture-that leverage natural language processing and convolutional neural network architectures to map input sequence data (k-mers) to output labels (taxonomic groups) without reliance on a curated taxonomic tree. We trained both models using metagenomic reads simulated from marine microbial genomes in the MarRef database. The performance of both models (accuracy, precision, and percent microbe predicted) was compared with the standard taxonomic classification tool Kraken2 using 10 complex metagenomic data sets simulated from MarRef. Our results demonstrate that time, compute power, and microbial genomic diversity still pose challenges for machine learning (ML). Moreover, our results suggest that high genome coverage and rectification of class imbalance are prerequisites for a well-trained model, and therefore should be a major consideration in future ML work. IMPORTANCE Taxonomic profiling of microbial communities is essential to model microbial interactions and inform habitat conservation. This work develops approaches in constructing training/testing data sets from publicly available marine metagenomes and evaluates the performance of machine learning (ML) approaches in read-based taxonomic classification of marine metagenomes. Predictions from two models are used to test accuracy in metagenomic classification and to guide improvements in ML approaches. Our study provides insights on the methods, results, and challenges of deep learning on marine microbial metagenomic data sets. Future machine learning approaches can be improved by rectifying genome coverage and class imbalance in the training data sets, developing alternative models, and increasing the accessibility of computational resources for model training and refinement.

Project description:BackgroundFor a sustainable production of food, research on agricultural soil microbial communities is inevitable. Due to its immense complexity, soil is still some kind of black box. Soil study designs for identifying microbiome members of relevance have various scopes and focus on particular environmental factors. To identify common features of soil microbiomes, data from multiple studies should be compiled and processed. Taxonomic compositions and functional capabilities of microbial communities associated with soils and plants have been identified and characterized in the past few decades. From a fertile Loess-Chernozem-type soil located in Germany, metagenomically assembled genomes (MAGs) classified as members of the phylum Thaumarchaeota/Thermoproteota were obtained. These possibly represent keystone agricultural soil community members encoding functions of relevance for soil fertility and plant health. Their importance for the analyzed microbiomes is corroborated by the fact that they were predicted to contribute to the cycling of nitrogen, feature the genetic potential to fix carbon dioxide and possess genes with predicted functions in plant-growth-promotion (PGP). To expand the knowledge on soil community members belonging to the phylum Thaumarchaeota, we conducted a meta-analysis integrating primary studies on European agricultural soil microbiomes.ResultsTaxonomic classification of the selected soil metagenomes revealed the shared agricultural soil core microbiome of European soils from 19 locations. Metadata reporting was heterogeneous between the different studies. According to the available metadata, we separated the data into 68 treatments. The phylum Thaumarchaeota is part of the core microbiome and represents a major constituent of the archaeal subcommunities in all European agricultural soils. At a higher taxonomic resolution, 2074 genera constituted the core microbiome. We observed that viral genera strongly contribute to variation in taxonomic profiles. By binning of metagenomically assembled contigs, Thaumarchaeota MAGs could be recovered from several European soil metagenomes. Notably, many of them were classified as members of the family Nitrososphaeraceae, highlighting the importance of this family for agricultural soils. The specific Loess-Chernozem Thaumarchaeota MAGs were most abundant in their original soil, but also seem to be of importance in other agricultural soil microbial communities. Metabolic reconstruction of Switzerland_1_MAG_2 revealed its genetic potential i.a. regarding carbon dioxide (CO[Formula: see text]) fixation, ammonia oxidation, exopolysaccharide production and a beneficial effect on plant growth. Similar genetic features were also present in other reconstructed MAGs. Three Nitrososphaeraceae MAGs are all most likely members of a so far unknown genus.ConclusionsOn a broad view, European agricultural soil microbiomes are similarly structured. Differences in community structure were observable, although analysis was complicated by heterogeneity in metadata recording. Our study highlights the need for standardized metadata reporting and the benefits of networking open data. Future soil sequencing studies should also consider high sequencing depths in order to enable reconstruction of genome bins. Intriguingly, the family Nitrososphaeraceae commonly seems to be of importance in agricultural microbiomes.

Dataset Information

CoMeta: classification of metagenomes using k-mers.

Publications

CoMeta: classification of metagenomes using k-mers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets