Dataset Information

A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis.

ABSTRACT:

Background

Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Data with labeling errors dampen true biological signals. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. We developed a robust probabilistic multi-omics data matching procedure, proMODMatcher, to curate data and identify and correct data annotation and errors in large databases.

Results

Application to simulated datasets suggests that proMODMatcher achieved robust statistical power even when the number of cis-associations was small and/or the number of samples was large. Application of our proMODMatcher to multi-omics datasets in The Cancer Genome Atlas and International Cancer Genome Consortium identified sample errors in multiple cancer datasets. Our procedure was not only able to identify sample-labeling errors but also to unambiguously identify the source of the errors. Our results demonstrate that these errors should be identified and corrected before integrative analysis.

Conclusions

Our results indicate that sample-labeling errors were common in large multi-omics datasets. These errors should be corrected before integrative analysis.

SUBMITTER: Lee E

PROVIDER: S-EPMC6615984 | biostudies-literature | 2019 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis.

Lee Eunjee E Yoo Seungyeul S Wang Wenhui W Tu Zhidong Z Zhu Jun J

GigaScience 20190701 7

<h4>Background</h4>Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Data with labeling errors dampen true biological signals. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. We developed a robust probabilistic multi-o ...[more]

PMID: 31289834

Similar Datasets

Project description:ObjectiveTo assess the accuracy of probabilistic deep learning models to discriminate normal eyes and eyes with glaucoma from fundus photographs and visual fields.DesignAlgorithm development for discriminating normal and glaucoma eyes using data from multicenter, cross-sectional, case-control study.Subjects and participantsFundus photograph and visual field data from 1,655 eyes of 929 normal and glaucoma subjects to develop and test deep learning models and an independent group of 196 eyes of 98 normal and glaucoma patients to validate deep learning models.Main outcome measuresAccuracy and area under the receiver-operating characteristic curve (AUC).MethodsFundus photographs and OCT images were carefully examined by clinicians to identify glaucomatous optic neuropathy (GON). When GON was detected by the reader, the finding was further evaluated by another clinician. Three probabilistic deep convolutional neural network (CNN) models were developed using 1,655 fundus photographs, 1,655 visual fields, and 1,655 pairs of fundus photographs and visual fields collected from Compass instruments. Deep learning models were trained and tested using 80% of fundus photographs and visual fields for training set and 20% of the data for testing set. Models were further validated using an independent validation dataset. The performance of the probabilistic deep learning model was compared with that of the corresponding deterministic CNN model.ResultsThe AUC of the deep learning model in detecting glaucoma from fundus photographs, visual fields, and combined modalities using development dataset were 0.90 (95% confidence interval: 0.89-0.92), 0.89 (0.88-0.91), and 0.94 (0.92-0.96), respectively. The AUC of the deep learning model in detecting glaucoma from fundus photographs, visual fields, and both modalities using the independent validation dataset were 0.94 (0.92-0.95), 0.98 (0.98-0.99), and 0.98 (0.98-0.99), respectively. The AUC of the deep learning model in detecting glaucoma from fundus photographs, visual fields, and both modalities using an early glaucoma subset were 0.90 (0.88,0.91), 0.74 (0.73,0.75), 0.91 (0.89,0.93), respectively. Eyes that were misclassified had significantly higher uncertainty in likelihood of diagnosis compared to eyes that were classified correctly. The uncertainty level of the correctly classified eyes is much lower in the combined model compared to the model based on visual fields only. The AUCs of the deterministic CNN model using fundus images, visual field, and combined modalities based on the development dataset were 0.87 (0.85,0.90), 0.88 (0.84,0.91), and 0.91 (0.89,0.94), and the AUCs based on the independent validation dataset were 0.91 (0.89,0.93), 0.97 (0.95,0.99), and 0.97 (0.96,0.99), respectively, while the AUCs based on an early glaucoma subset were 0.88 (0.86,0.91), 0.75 (0.73,0.77), and 0.92 (0.89,0.95), respectively.Conclusion and relevanceProbabilistic deep learning models can detect glaucoma from multi-modal data with high accuracy. Our findings suggest that models based on combined visual field and fundus photograph modalities detects glaucoma with higher accuracy. While probabilistic and deterministic CNN models provided similar performance, probabilistic models generate certainty level of the outcome thus providing another level of confidence in decision making.

Dataset Information

A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis.

Background

Results

Conclusions

Publications

A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets