Unknown

Dataset Information

0

Towards a reference genome that captures global genetic diversity.


ABSTRACT: The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877?bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.

SUBMITTER: Wong KHY 

PROVIDER: S-EPMC7599213 | biostudies-literature | 2020 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications


The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotation  ...[more]

Similar Datasets

| S-EPMC6382770 | biostudies-literature
| S-EPMC10868604 | biostudies-literature
| S-EPMC6056549 | biostudies-literature
| S-EPMC4750478 | biostudies-literature
| S-EPMC10197727 | biostudies-literature
| S-EPMC11019364 | biostudies-literature
| S-EPMC5079299 | biostudies-literature
| S-EPMC5123046 | biostudies-literature
| S-EPMC5123671 | biostudies-literature
| S-EPMC4490614 | biostudies-literature