Dataset Information

CDSbank: taxonomy-aware extraction, selection, renaming and formatting of protein-coding DNA or amino acid sequences.

ABSTRACT:

Background

Protein-coding DNA sequences and their corresponding amino acid sequences are routinely used to study relationships between sequence, structure, function, and evolution. The rapidly growing size of sequence databases increases the power of such comparative analyses but it makes it more challenging to prepare high quality sequence data sets with control over redundancy, quality, completeness, formatting, and labeling. Software tools for some individual steps in this process exist but manual intervention remains a common and time consuming necessity.

Description

CDSbank is a database that stores both the protein-coding DNA sequence (CDS) and amino acid sequence for each protein annotated in Genbank. CDSbank also stores Genbank feature annotation, a flag to indicate incomplete 5' and 3' ends, full taxonomic data, and a heuristic to rank the scientific interest of each species. This rich information allows fully automated data set preparation with a level of sophistication that aims to meet or exceed manual processing. Defaults ensure ease of use for typical scenarios while allowing great flexibility when needed. Access is via a free web server at http://hazeslab.med.ualberta.ca/CDSbank/.

Conclusions

CDSbank presents a user-friendly web server to download, filter, format, and name large sequence data sets. Common usage scenarios can be accessed via pre-programmed default choices, while optional sections give full control over the processing pipeline. Particular strengths are: extract protein-coding DNA sequences just as easily as amino acid sequences, full access to taxonomy for labeling and filtering, awareness of incomplete sequences, and the ability to take one protein sequence and extract all synonymous CDS or identical protein sequences in other species. Finally, CDSbank can also create labeled property files to, for instance, annotate or re-label phylogenetic trees.

SUBMITTER: Hazes B

PROVIDER: S-EPMC3942066 | biostudies-literature | 2014 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

CDSbank: taxonomy-aware extraction, selection, renaming and formatting of protein-coding DNA or amino acid sequences.

Hazes Bart B

BMC bioinformatics 20140228

<h4>Background</h4>Protein-coding DNA sequences and their corresponding amino acid sequences are routinely used to study relationships between sequence, structure, function, and evolution. The rapidly growing size of sequence databases increases the power of such comparative analyses but it makes it more challenging to prepare high quality sequence data sets with control over redundancy, quality, completeness, formatting, and labeling. Software tools for some individual steps in this process exi ...[more]

PMID: 24580755

Dataset Information

CDSbank: taxonomy-aware extraction, selection, renaming and formatting of protein-coding DNA or amino acid sequences.

Background

Description

Conclusions

Publications

CDSbank: taxonomy-aware extraction, selection, renaming and formatting of protein-coding DNA or amino acid sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

CodonTest: modeling amino acid substitution preferences in coding sequences.
| S-EPMC2924240 | biostudies-literature

qNABpredict: Quick, accurate, and taxonomy-aware sequence-based prediction of content of nucleic acid binding amino acids.
| S-EPMC9798252 | biostudies-literature

TADA: taxonomy-aware dataset aggregator.
| S-EPMC10733731 | biostudies-literature

Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles.
| S-EPMC2842053 | biostudies-literature

NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents.
| S-EPMC5106001 | biostudies-literature

Taxonomy-aware feature engineering for microbiome classification.
| S-EPMC6003080 | biostudies-literature

Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences.
| S-EPMC2373523 | biostudies-literature

COMIT: identification of noncoding motifs under selection in coding sequences.
| S-EPMC3091326 | biostudies-literature

Distinguishing proteins from arbitrary amino acid sequences.
| S-EPMC4302309 | biostudies-literature

The role of long non-coding RNAs in genome formatting and expression.
| S-EPMC4413816 | biostudies-literature