Unknown

Dataset Information

0

RResolver: efficient short-read repeat resolution within ABySS.


ABSTRACT:

Background

De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by [Formula: see text] bases, and nodes along unambiguous walks in the graph are subsequently merged. The selection of k is influenced by multiple factors, and optimizing this value results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, current approaches that use multiple k values do not address the scalability issues inherent to the assembly of large genomes.

Results

Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths with insufficient support. RResolver runs efficiently, taking only 26 min on average for an ABySS human assembly with 48 threads and 60 GiB memory. Across all experiments, compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 15% and reduces misassemblies by up to 12%.

Conclusions

RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome. The RResolver code is integrated into ABySS and is available at https://github.com/bcgsc/abyss/tree/master/RResolver .

SUBMITTER: Nikolic V 

PROVIDER: S-EPMC9215042 | biostudies-literature | 2022 Jun

REPOSITORIES: biostudies-literature

altmetric image

Publications

RResolver: efficient short-read repeat resolution within ABySS.

Nikolić Vladimir V   Afshinfard Amirhossein A   Chu Justin J   Wong Johnathan J   Coombe Lauren L   Nip Ka Ming KM   Warren René L RL   Birol Inanç I  

BMC bioinformatics 20220621 1


<h4>Background</h4>De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by [Formula: see text] bases, and nodes along unambiguous walks in the graph are subsequently me  ...[more]

Similar Datasets

| S-EPMC2694472 | biostudies-literature
| S-EPMC5411769 | biostudies-literature
| S-EPMC3044310 | biostudies-literature
| S-EPMC9174224 | biostudies-literature
| S-EPMC2656530 | biostudies-literature
| S-EPMC2902515 | biostudies-literature
2008-10-24 | GSE13322 | GEO
| S-EPMC3626529 | biostudies-literature
| S-EPMC7850483 | biostudies-literature
| S-EPMC10809544 | biostudies-literature