Unknown

Dataset Information

0

EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data.


ABSTRACT: The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.

SUBMITTER: Zhang Z 

PROVIDER: S-EPMC6468244 | biostudies-literature | 2019 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data.

Zhang Zhongyang Z   Cheng Haoxiang H   Hong Xiumei X   Di Narzo Antonio F AF   Franzen Oscar O   Peng Shouneng S   Ruusalepp Arno A   Kovacic Jason C JC   Bjorkegren Johan L M JLM   Wang Xiaobin X   Hao Ke K  

Nucleic acids research 20190401 7


The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into C  ...[more]

Similar Datasets

| S-EPMC4015297 | biostudies-literature
| S-EPMC2650004 | biostudies-literature
| S-EPMC3146450 | biostudies-literature
| S-EPMC3421116 | biostudies-other
| S-EPMC2784334 | biostudies-literature
| S-EPMC4254366 | biostudies-literature
| S-EPMC4410664 | biostudies-literature
| S-EPMC5430420 | biostudies-literature
| S-EPMC3626776 | biostudies-literature
| S-EPMC3380049 | biostudies-literature