Dataset Information

NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents.

ABSTRACT: The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.

SUBMITTER: Liu SS

PROVIDER: S-EPMC5106001 | biostudies-literature | 2016 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents.

Liu Sophia S SS Hockenberry Adam J AJ Lancichinetti Andrea A Jewett Michael C MC Amaral Luís A N LA

PLoS computational biology 20161111 11

The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequenc ...[more]

PMID: 27835644

Dataset Information

NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents.

Publications

NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents.
| S-EPMC7127556 | biostudies-literature

Modeling compositional dynamics based on GC and purine contents of protein-coding sequences.
| S-EPMC2989939 | biostudies-literature

CodonTest: modeling amino acid substitution preferences in coding sequences.
| S-EPMC2924240 | biostudies-literature

In Silico Engineering of Synthetic Binding Proteins from Random Amino Acid Sequences.
| S-EPMC6348295 | biostudies-literature

An integrated Java tool for generating amino acid sequence alignments with mapped secondary structure elements.
| S-EPMC4327748 | biostudies-literature

CodonShuffle: a tool for generating and analyzing synonymously mutated sequences.
| S-EPMC5014483 | biostudies-literature

IMGT/Collier-de-Perles: a two-dimensional visualization tool for amino acid domain sequences.
| S-EPMC3621776 | biostudies-literature

Protein tolerance to random amino acid change.
| S-EPMC438954 | biostudies-literature

CDSbank: taxonomy-aware extraction, selection, renaming and formatting of protein-coding DNA or amino acid sequences.
| S-EPMC3942066 | biostudies-literature