Unknown

Dataset Information

0

The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines.


ABSTRACT: A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing-based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)-(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 - the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models.

SUBMITTER: Kotlarz K 

PROVIDER: S-EPMC7652806 | biostudies-literature | 2020 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines.

Kotlarz Krzysztof K   Mielczarek Magda M   Suchocki Tomasz T   Czech Bartosz B   Guldbrandtsen Bernt B   Szyda Joanna J  

Journal of applied genetics 20200929 4


A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing-based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (i  ...[more]

Similar Datasets

| S-EPMC5545773 | biostudies-other
| S-EPMC9923443 | biostudies-literature
| S-EPMC8511039 | biostudies-literature
| S-EPMC5967816 | biostudies-literature
| S-EPMC8921609 | biostudies-literature
| S-EPMC5799025 | biostudies-literature
| S-EPMC8384043 | biostudies-literature
2022-12-22 | GSE218466 | GEO
| S-EPMC1976263 | biostudies-literature
| S-EPMC7459797 | biostudies-literature