Dataset Information

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance.

ABSTRACT:

Background

There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio. Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature. To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance.

Methods

We used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours (kNN) with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method.

Results and conclusions

The performed experiments have shown that the appropriate selection of the reference sample set may greatly improve the CNV detection rate. In particular, we found that smart reduction of reference sample size may significantly increase the algorithms' precision while having negligible negative effect on sensitivity. We observed that a complete CNV calling process with the k-means algorithm as the selection method has significantly better time complexity than kNN-based solution.

SUBMITTER: Kusmirek W

PROVIDER: S-EPMC6537193 | biostudies-literature | 2019 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance.

Kuśmirek Wiktor W Szmurło Agnieszka A Wiewiórka Marek M Nowak Robert R Gambin Tomasz T

BMC bioinformatics 20190528 1

<h4>Background</h4>There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample ...[more]

PMID: 31138108

Dataset Information

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance.

Background

Methods

Results and conclusions

Publications

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Training set optimization of genomic prediction by means of EthAcc.
| S-EPMC6380617 | biostudies-literature

Set cover-based methods for motif selection.
| S-EPMC7703758 | biostudies-literature

Training set optimization under population structure in genomic selection.
| S-EPMC4282691 | biostudies-literature

Improved selection of canonical proteins for reference proteomes.
| S-EPMC11165316 | biostudies-literature

Plasmodium falciparum CNV-SNP chip optimization
2011-04-01 | E-GEOD-28287 | biostudies-arrayexpress

Plasmodium falciparum CNV-SNP chip optimization
2011-04-01 | GSE28287 | GEO

Set-theory based benchmarking of three different variant callers for targeted sequencing.
| S-EPMC7791862 | biostudies-literature

A comparison of methods for training population optimization in genomic selection.
| S-EPMC9998580 | biostudies-literature

Processing Optimization and Toxicological Evaluation of "Lead-Free" Piezoceramics: A KNN-Based Case Study.
| S-EPMC8348597 | biostudies-literature

Optimization of membrane dispersion ethanol precipitation process with a set of temperature control improved equipment.
| S-EPMC7643161 | biostudies-literature