Dataset Information

The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies.

ABSTRACT: BACKGROUND:The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein coding genes are needed in particular for phylogenetics and for the analysis of gene family histories. RESULTS:We present a pipeline, ExonMatchSolver, that is designed to help the user to produce and curate high quality models of the protein-coding part of genes. The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. This paralog-to-contig assignment problem is shown to be NP-complete. It is phrased and solved as an Integer Linear Programming problem. CONCLUSIONS:The ExonMatchSolver-pipeline can be employed to build highly accurate models of protein coding genes even when spanning several genomic fragments. This sets the stage for a better understanding of the evolutionary history within particular gene families which possess a large number of paralogs and in which frequent gene duplication events occurred.

SUBMITTER: Indrischek H

PROVIDER: S-EPMC4765045 | biostudies-literature | 2016

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies.

Indrischek Henrike H Wieseke Nicolas N Stadler Peter F PF Prohaska Sonja J SJ

Algorithms for molecular biology : AMB 20160224

<h4>Background</h4>The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds. However, these genomes deliver valuable information for studying gene families. H ...[more]

PMID: 26913054

Dataset Information

The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies.

Publications

The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Phables: from fragmented assemblies to high-quality bacteriophage genomes.
| S-EPMC10563150 | biostudies-literature

Phables: from fragmented assemblies to high-quality bacteriophage genomes.
| S-EPMC10104058 | biostudies-literature

High-Quality Genome-Scale Models From Error-Prone, Long-Read Assemblies.
| S-EPMC7688782 | biostudies-literature

BESST--efficient scaffolding of large fragmented assemblies.
| S-EPMC4262078 | biostudies-literature

Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ).
| S-EPMC4298792 | biostudies-literature

Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies.
| S-EPMC6267036 | biostudies-literature

Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies.
| S-EPMC1160117 | biostudies-literature

msmsEval: tandem mass spectral quality assignment for high-throughput proteomics.
| S-EPMC1803797 | biostudies-literature

Ten new high-quality genome assemblies for diverse bioenergy sorghum genotypes.
| S-EPMC9846640 | biostudies-literature

High-quality draft assemblies of mammalian genomes from massively parallel sequence data.
| S-EPMC3029755 | biostudies-literature