Dataset Information

Recurrent miscalling of missense variation from short-read genome sequence data.

ABSTRACT:

Background

Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation.

Results

We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2-300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3-5000 recurrent false positive variants per mouse - the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation.

Conclusion

Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome - which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.

SUBMITTER: Field MA

PROVIDER: S-EPMC6631443 | biostudies-literature | 2019 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Recurrent miscalling of missense variation from short-read genome sequence data.

Field Matthew A MA Burgio Gaetan G Chuah Aaron A Al Shekaili Jalila J Hassan Batool B Al Sukaiti Nashat N Foote Simon J SJ Cook Matthew C MC Andrews T Daniel TD

BMC genomics 20190716 Suppl 8

<h4>Background</h4>Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation.<h4>Results</h4>We find t ...[more]

PMID: 31307400

Dataset Information

Recurrent miscalling of missense variation from short-read genome sequence data.

Background

Results

Conclusion

Publications

Recurrent miscalling of missense variation from short-read genome sequence data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Discovery and genotyping of structural variation from long-read haploid genome sequence data.
| S-EPMC5411763 | biostudies-literature

ABySS: a parallel assembler for short read sequence data.
| S-EPMC2694472 | biostudies-literature

Non-referenced genome assembly from epigenomic short-read data.
| S-EPMC4622496 | biostudies-literature

Evaluation of whole-genome sequence data analysis approaches for short- and long-read sequencing of Mycobacterium tuberculosis.
| S-EPMC8743536 | biostudies-literature

A Comprehensive Workflow for Read Depth-Based Identification of Copy-Number Variation from Whole-Genome Sequence Data.
| S-EPMC5777982 | biostudies-literature

Paragraph: a graph-based structural variant genotyper for short-read sequence data.
| S-EPMC6921448 | biostudies-literature

Copy number variant detection in inbred strains from short read sequence data.
| S-EPMC2820678 | biostudies-literature

Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding.
| S-EPMC2752135 | biostudies-literature

Determining Streptococcus suis serotype from short-read whole-genome sequencing data.
| S-EPMC4957933 | biostudies-literature

Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies.
| S-EPMC8206509 | biostudies-literature