Unknown

Dataset Information

0

Performance comparisons between clustering models for reconstructing NGS results from technical replicates.


ABSTRACT: To improve the performance of individual DNA sequencing results, researchers often use replicates from the same individual and various statistical clustering models to reconstruct a high-performance callset. Here, three technical replicates of genome NA12878 were considered and five model types were compared (consensus, latent class, Gaussian mixture, Kamila-adapted k-means, and random forest) regarding four performance indicators: sensitivity, precision, accuracy, and F1-score. In comparison with no use of a combination model, i) the consensus model improved precision by 0.1%; ii) the latent class model brought 1% precision improvement (97%-98%) without compromising sensitivity (= 98.9%); iii) the Gaussian mixture model and random forest provided callsets with higher precisions (both >99%) but lower sensitivities; iv) Kamila increased precision (>99%) and kept a high sensitivity (98.8%); it showed the best overall performance. According to precision and F1-score indicators, the compared non-supervised clustering models that combine multiple callsets are able to improve sequencing performance vs. previously used supervised models. Among the models compared, the Gaussian mixture model and Kamila offered non-negligible precision and F1-score improvements. These models may be thus recommended for callset reconstruction (from either biological or technical replicates) for diagnostic or precision medicine purposes.

SUBMITTER: Zhai Y 

PROVIDER: S-EPMC10060969 | biostudies-literature | 2023

REPOSITORIES: biostudies-literature

altmetric image

Publications

Performance comparisons between clustering models for reconstructing NGS results from technical replicates.

Zhai Yue Y   Bardel Claire C   Vallée Maxime M   Iwaz Jean J   Roy Pascal P  

Frontiers in genetics 20230316


To improve the performance of individual DNA sequencing results, researchers often use replicates from the same individual and various statistical clustering models to reconstruct a high-performance callset. Here, three technical replicates of genome NA12878 were considered and five model types were compared (consensus, latent class, Gaussian mixture, Kamila-adapted k-means, and random forest) regarding four performance indicators: sensitivity, precision, accuracy, and F1-score. In comparison wi  ...[more]

Similar Datasets

| S-EPMC8086011 | biostudies-literature
2009-02-20 | GSE12118 | GEO
| S-EPMC2262854 | biostudies-literature
| S-EPMC5864873 | biostudies-literature
2014-02-19 | GSE52731 | GEO
2009-02-19 | E-GEOD-12118 | biostudies-arrayexpress
| S-EPMC7667810 | biostudies-literature
2018-04-03 | GSE95155 | GEO
2016-08-16 | GSE81359 | GEO
| S-EPMC5115855 | biostudies-other