Unknown

Dataset Information

0

ScanPAV: a pipeline for extracting presence-absence variations in genome pairs.


ABSTRACT: The recent technological advances in genome sequencing techniques have resulted in an exponential increase in the number of sequenced human and non-human genomes. The ever increasing number of assemblies generated by novel de novo pipelines and strategies demands the development of new software to evaluate assembly quality and completeness. One way to determine the completeness of an assembly is by detecting its Presence-Absence variations (PAV) with respect to a reference, where PAVs between two assemblies are defined as the sequences present in one assembly but entirely missing in the other one. Beyond assembly error or technology bias, PAVs can also reveal real genome polymorphism, consequence of species or individual evolution, or horizontal transfer from viruses and bacteria.We present scanPAV, a pipeline for pairwise assembly comparison to identify and extract sequences present in one assembly but not the other. In this note we use the GRCh38 reference assembly to assess the completeness of six human genome assemblies from various assembly strategies and sequencing technologies including Illumina short reads, 10X genomics linked-reads, PacBio and Oxford Nanopore long reads, and Bionano optical maps. We also discuss the PAV polymorphism of seven Tasmanian devil whole genome assemblies of normal animal tissues and devil facial tumour 1 (DFT1) and 2 (DFT2) samples, and the identification of bacterial sequences as contamination in some of the tumorous assemblies.The pipeline is available under the MIT License at https://github.com/wtsi-hpag/scanPAV.zemin.ning@sanger.ac.uk, francesca.giordano@sanger.ac.uk.A supplementary note is available at Bioinformatics online.

SUBMITTER: Giordano F 

PROVIDER: S-EPMC6129304 | biostudies-other | 2018 Mar

REPOSITORIES: biostudies-other

altmetric image

Publications

scanPAV: a pipeline for extracting presence-absence variations in genome pairs.

Giordano Francesca F   Stammnitz Maximilian R MR   Murchison Elizabeth P EP   Ning Zemin Z  

Bioinformatics (Oxford, England) 20180901 17


<h4>Motivation</h4>The recent technological advances in genome sequencing techniques have resulted in an exponential increase in the number of sequenced human and non-human genomes. The ever increasing number of assemblies generated by novel de novo pipelines and strategies demands the development of new software to evaluate assembly quality and completeness. One way to determine the completeness of an assembly is by detecting its Presence-Absence variations (PAV) with respect to a reference, wh  ...[more]

Similar Datasets

| S-EPMC8990847 | biostudies-literature
| S-EPMC7615310 | biostudies-literature
| S-EPMC6882866 | biostudies-literature
| S-EPMC10390606 | biostudies-literature
| S-EPMC7514148 | biostudies-literature
| S-EPMC7372895 | biostudies-literature
| S-EPMC6058306 | biostudies-literature
| S-EPMC7653742 | biostudies-literature
| S-EPMC2727387 | biostudies-literature
| S-EPMC8156620 | biostudies-literature