Dataset Information

Uniclust databases of clustered and deeply annotated protein sequences and alignments.

ABSTRACT: We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uniboost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust.mmseqs.com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.

SUBMITTER: Mirdita M

PROVIDER: S-EPMC5614098 | biostudies-literature | 2017 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Uniclust databases of clustered and deeply annotated protein sequences and alignments.

Mirdita Milot M von den Driesch Lars L Galiez Clovis C Martin Maria J MJ Söding Johannes J Steinegger Martin M

Nucleic acids research 20161128 D1

We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 an ...[more]

PMID: 27899574

Dataset Information

Uniclust databases of clustered and deeply annotated protein sequences and alignments.

Publications

Uniclust databases of clustered and deeply annotated protein sequences and alignments.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Searching databases of conserved sequence regions by aligning protein multiple-alignments.
| S-EPMC146152 | biostudies-other

Clustal Omega for making accurate alignments of many protein sequences.
| S-EPMC5734385 | biostudies-literature

KinFin: Software for Taxon-Aware Analysis of Clustered Protein Sequences.
| S-EPMC5633385 | biostudies-literature

MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments.
| S-EPMC7218316 | biostudies-literature

Collection of Annotated Acinetobacter Genome Sequences.
| S-EPMC10019184 | biostudies-literature

Vespucci: a system for building annotated databases of nascent transcripts.
| S-EPMC3936758 | biostudies-literature

Annotating RNA motifs in sequences and alignments.
| S-EPMC4333381 | biostudies-literature

DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches.
| S-EPMC102675 | biostudies-literature

An ensemble approach for large-scale identification of protein- protein interactions using the alignments of multiple sequences.
| S-EPMC5354898 | biostudies-literature

Using sound to understand protein sequence data: new sonification algorithms for protein sequences and multiple sequence alignments.
| S-EPMC8459479 | biostudies-literature