Dataset Information

Comparative genome analysis using sample-specific string detection in accurate long reads.

ABSTRACT:

Motivation

Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants).

Results

We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome ('samples-specific' strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data).

Availability and implementation

Data, code and instructions for reproducing the results presented in this manuscript are publicly available at https://github.com/Parsoa/PingPong.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

SUBMITTER: Khorsand P

PROVIDER: S-EPMC9710709 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Comparative genome analysis using sample-specific string detection in accurate long reads.

Khorsand Parsoa P Denti Luca L Bonizzoni Paola P Chikhi Rayan R Hormozdiari Fereydoun F

Bioinformatics advances 20210531 1

<h4>Motivation</h4>Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segment ...[more]

PMID: 36700094

Dataset Information

Comparative genome analysis using sample-specific string detection in accurate long reads.

Motivation

Results

Availability and implementation

Supplementary information

Publications

Comparative genome analysis using sample-specific string detection in accurate long reads.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Accurate isoform discovery with IsoQuant using long reads.
| S-EPMC10344776 | biostudies-literature

Severus: accurate detection and characterization of somatic structural variation in tumor genomes using long reads.
| S-EPMC10996739 | biostudies-literature

Hap10: reconstructing accurate and long polyploid haplotypes using linked reads.
| S-EPMC7302376 | biostudies-literature

Fast and accurate de novo genome assembly from long uncorrected reads.
| S-EPMC5411768 | biostudies-literature

Analyzing rare mutations in metagenomes assembled using long and accurate reads.
| S-EPMC9808630 | biostudies-literature

Accurate indel prediction using paired-end short reads.
| S-EPMC3614465 | biostudies-other

Accurate self-correction of errors in long reads using de Bruijn graphs.
| S-EPMC5351550 | biostudies-literature

Fast and accurate mapping of long reads to complete genome assemblies with VerityMap.
| S-EPMC9808623 | biostudies-literature

Haplotype threading: accurate polyploid phasing from long reads.
| S-EPMC7504856 | biostudies-literature

Whole-genome haplotyping using long reads and statistical methods.
| S-EPMC4073643 | biostudies-literature