Dataset Information

Sma3s: a three-step modular annotator for large sequence datasets.

ABSTRACT: Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes.

SUBMITTER: Munoz-Merida A

PROVIDER: S-EPMC4131829 | biostudies-literature | 2014 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Sma3s: a three-step modular annotator for large sequence datasets.

Muñoz-Mérida Antonio A Viguera Enrique E Claros M Gonzalo MG Trelles Oswaldo O Pérez-Pulido Antonio J AJ

DNA research : an international journal for rapid publication of reports on genes and genomes 20140205 4

Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-spe ...[more]

PMID: 24501397

Dataset Information

Sma3s: a three-step modular annotator for large sequence datasets.

Publications

Sma3s: a three-step modular annotator for large sequence datasets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Scaling statistical multiple sequence alignment to large datasets.
| S-EPMC5123300 | biostudies-literature

CPGAVAS2, an integrated plastome sequence annotator and analyzer.
| S-EPMC6602467 | biostudies-literature

SHEPHARD: a modular and extensible software architecture for analyzing and annotating large protein datasets.
| S-EPMC10423030 | biostudies-literature

Alignment-Annotator web server: rendering and annotating sequence alignments.
| S-EPMC4086088 | biostudies-literature

LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins.
| S-EPMC7203729 | biostudies-literature

Modular Two-Step Route to Sulfondiimidamides.
| S-EPMC9264364 | biostudies-literature

TIMPs of parasitic helminths - a large-scale analysis of high-throughput sequence datasets.
| S-EPMC3679795 | biostudies-literature

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets.
| S-EPMC9621593 | biostudies-literature

CRMnet: A deep learning model for predicting gene expression from large regulatory sequence datasets.
| S-EPMC10043243 | biostudies-literature

K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets.
| S-EPMC5320600 | biostudies-literature