Unknown

Dataset Information

0

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data.


ABSTRACT: BACKGROUND:Previous studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on distinct feature selection and modeling strategies. The accuracy of biomarker selection and outlier detection of these methods needs to be evaluated and compared so that the appropriate method can be chosen. RESULTS:The accuracy of variable selection, outlier identification, and prediction of three methods (Ensemble, enetLTS, Rlogreg) were compared for simulated and an RNA-seq dataset. On simulated datasets, Ensemble had the highest variable selection accuracy, as measured by a comprehensive index, and lowest false discovery rate among the three methods. When the sample size was large and the proportion of outliers was ?5%, the positive selection rate of Ensemble was similar to that of enetLTS. However, when the proportion of outliers was 10% or 15%, Ensemble missed some variables that affected the response variables. Overall, enetLTS had the best outlier detection accuracy with false positive rates ?5%, Ensemble can be used for variable selection on a subset after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers can be estimated according to the inaccuracy of the diagnostic methods used.

SUBMITTER: Sun H 

PROVIDER: S-EPMC7646480 | biostudies-literature | 2020 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data.

Sun Hongwei H   Cui Yuehua Y   Wang Hui H   Liu Haixia H   Wang Tong T  

BMC bioinformatics 20200814 1


<h4>Background</h4>Previous studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on dis  ...[more]

Similar Datasets

| S-EPMC8384175 | biostudies-literature
| S-EPMC9775581 | biostudies-literature
| 2745213 | ecrin-mdr-crc
| S-EPMC5305221 | biostudies-literature
| S-EPMC8122584 | biostudies-literature
| PRJEB42541 | ENA
| S-EPMC3838370 | biostudies-literature
| S-EPMC8981526 | biostudies-literature
| S-EPMC10366886 | biostudies-literature
| S-EPMC10617639 | biostudies-literature