Unknown

Dataset Information

0

Efficient detection and assembly of non-reference DNA sequences with synthetic long reads.


ABSTRACT: Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion's share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-read (aka linked-read) technologies have great potential for the characterization of NRSs. While synthetic long reads require less input DNA than long-read datasets, they are algorithmically more challenging to use. Except for computationally expensive whole-genome assembly methods, there is no synthetic long-read method for NRS detection. We propose a novel integrated alignment-based and local assembly-based algorithm, Novel-X, that uses the barcode information encoded in synthetic long reads to improve the detection of such events without a whole-genome de novo assembly. Our evaluations demonstrate that Novel-X finds many non-reference sequences that cannot be found by state-of-the-art short-read methods. We applied Novel-X to a diverse set of 68 samples from the Polaris HiSeq 4000 PGx cohort. Novel-X discovered 16 691 NRS insertions of size > 300 bp (total length 18.2 Mb). Many of them are population specific or may have a functional impact.

SUBMITTER: Meleshko D 

PROVIDER: S-EPMC9561269 | biostudies-literature | 2022 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

Efficient detection and assembly of non-reference DNA sequences with synthetic long reads.

Meleshko Dmitry D   Yang Rui R   Marks Patrick P   Williams Stephen S   Hajirasouliha Iman I  

Nucleic acids research 20221001 18


Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion's share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-read (aka linked-read) technologies have great potential for the characterization of NRSs. While synthetic long reads require less input DNA than long-read datasets, they are algorithmically more challen  ...[more]

Similar Datasets

| S-EPMC6612831 | biostudies-literature
| S-EPMC9700288 | biostudies-literature
| S-EPMC3592409 | biostudies-literature
| S-EPMC9946810 | biostudies-literature
| S-EPMC8193415 | biostudies-literature
| S-EPMC11601376 | biostudies-literature
| S-EPMC6877557 | biostudies-literature
| S-EPMC4460631 | biostudies-literature
| S-EPMC7889865 | biostudies-literature
| S-EPMC4154752 | biostudies-literature