Dataset Information

Massive metagenomic data analysis using abundance-based machine learning.

ABSTRACT:

Background

Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to understand the metagenomic profile of one of the densest interaction spaces for millions of people, the public transit system, the MetaSUB international Consortium has collected and sequenced metagenomes from subways of different cities across the world. In collaboration with CAMDA, MetaSUB has made the metagenomic samples from these cities available for an open challenge of data analysis including, but not limited in scope to, the identification of unknown samples.

Results

To distinguish the metagenomic profiling among different cities and also predict unknown samples precisely based on the profiling, two different approaches are proposed using machine learning techniques; one is a read-based taxonomy profiling of each sample and prediction method, and the other is a reduced representation assembly-based method. Among various machine learning techniques tested, the random forest technique showed promising results as a suitable classifier for both approaches. Random forest models developed from read-based taxonomic profiling could achieve an accuracy of 91% with 95% confidence interval between 80 and 93%. The assembly-based random forest model prediction also reached 90% accuracy. However, both models achieved roughly the same accuracy on the testing test, whereby they both failed to predict the most abundant label.

Conclusion

Our results suggest that both read-based and assembly-based approaches are powerful tools for the analysis of metagenomics data. Moreover, our results suggest that reduced representation assembly-based methods are able to simultaneous provide high-accuracy prediction on available data. Overall, we show that metagenomic samples can be traced back to their location with careful generation of features from the composition of microbes and utilizing existing machine learning algorithms. Proposed approaches show high accuracy of prediction, but require careful inspection before making any decisions due to sample noise or complexity.

Reviewers

This article was reviewed by Eugene V. Koonin, Jing Zhou and Serghei Mangul.

SUBMITTER: Harris ZN

PROVIDER: S-EPMC6676585 | biostudies-literature | 2019 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Massive metagenomic data analysis using abundance-based machine learning.

Harris Zachary N ZN Dhungel Eliza E Mosior Matthew M Ahn Tae-Hyuk TH

Biology direct 20190801 1

<h4>Background</h4>Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to understand the metagenomic profile of one of the densest interaction spaces for millions of people, the public transit system, the MetaSUB international Consortium has collected and sequenced metageno ...[more]

PMID: 31370905

Similar Datasets

Project description:BackgroundThe advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction.ResultsComparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data.ConclusionsHerein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin.

Project description:BackgroundCombining machine learning (ML) with gait analysis is widely applicable for diagnosing abnormal gait patterns.ObjectiveTo analyze gait adaptability characteristics in stroke patients, develop ML models to identify individuals with GAD, and select optimal diagnostic models and key classification features.MethodsThis study was investigated with 30 stroke patients (mean age 42.69 years, 60% male) and 50 healthy adults (mean age 41.34 years, 58% male). Gait adaptability was assessed using a CMill treadmill on gait adaptation tasks: target stepping, slalom walking, obstacle avoidance, and speed adaptation. The preliminary analysis of variables in both groups was conducted using t-tests and Pearson correlation. Features were extracted from demographics, gait kinematics, and gait adaptability datasets. ML models based on Support Vector Machine, Decision Tree, Multi-layer Perceptron, K-Nearest Neighbors, and AdaCost algorithm were trained to classify individuals with and without GAD. Model performance was evaluated using accuracy (ACC), sensitivity (SEN), F1-score and the area under the receiver operating characteristic (ROC) curve (AUC).ResultsThe stroke group showed a significantly decreased gait speed (p = 0.000) and step length (SL) (p = 0.000), while the asymmetry of SL (p = 0.000) and ST (p = 0.000) was higher compared to the healthy group. The gait adaptation tasks significantly decreased in slalom walking (p = 0.000), obstacle avoidance (p = 0.000), and speed adaptation (p = 0.000). Gait speed (p = 0.000) and obstacle avoidance (p = 0.000) were significantly correlated with global F-A score in stroke patients. The AdaCost demonstrated better classification performance with an ACC of 0.85, SEN of 0.80, F1-score of 0.77, and ROC-AUC of 0.75. Obstacle avoidance and gait speed were identified as critical features in this model.ConclusionStroke patients walk slower with shorter SL and more asymmetry of SL and ST. Their gait adaptability was decreased, particularly in obstacle avoidance and speed adaptation. The faster gait speed and better obstacle avoidance were correlated with better functional mobility. The AdaCost identifies individuals with GAD and facilitates clinical decision-making. This advances the future development of user-friendly interfaces and computer-aided diagnosis systems.

Dataset Information

Massive metagenomic data analysis using abundance-based machine learning.

Background

Results

Conclusion

Reviewers

Publications

Massive metagenomic data analysis using abundance-based machine learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets