Unknown

Dataset Information

0

Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences.


ABSTRACT: Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.

SUBMITTER: Zhao T 

PROVIDER: S-EPMC7407462 | biostudies-literature | 2020 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences.

Zhao Tingting T   Duan Zhongqu Z   Genchev Georgi Z GZ   Lu Hui H  

G3 (Bethesda, Md.) 20200805 8


Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome <i>de novo</i> assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 <i>de novo</i> long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps  ...[more]

Similar Datasets

| S-EPMC2718494 | biostudies-literature
2007-12-23 | GSE9075 | GEO
| S-EPMC6796347 | biostudies-literature
| S-EPMC3651407 | biostudies-literature
| S-EPMC442158 | biostudies-literature
| S-EPMC8478193 | biostudies-literature
| S-EPMC3784957 | biostudies-literature
| S-EPMC4091766 | biostudies-literature
| S-EPMC4470636 | biostudies-literature
| S-EPMC4192373 | biostudies-literature