Dataset Information

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

ABSTRACT: The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

SUBMITTER: Li W

PROVIDER: S-EPMC7779008 | biostudies-literature | 2021 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Li Wenjun W O'Neill Kathleen R KR Haft Daniel H DH DiCuccio Michael M Chetvernin Vyacheslav V Badretdin Azat A Coulouris George G Chitsaz Farideh F Derbyshire Myra K MK Durkin A Scott AS Gonzales Noreen R NR Gwadz Marc M Lanczycki Christopher J CJ Song James S JS Thanki Narmada N Wang Jiyao J Yamashita Roxanne A RA Yang Mingzhang M Zheng Chanjuan C Marchler-Bauer Aron A Thibaud-Nissen Françoise F

Nucleic acids research 20210101 D1

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 p ...[more]

PMID: 33270901

Dataset Information

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Publications

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

RefSeq: an update on prokaryotic genome annotation and curation.
| S-EPMC5753331 | biostudies-literature

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.
| S-EPMC10767926 | biostudies-literature

NCBI prokaryotic genome annotation pipeline.
| S-EPMC5001611 | biostudies-literature

RefSeq curation and annotation of stop codon recoding in vertebrates.
| S-EPMC6344875 | biostudies-literature

NCBI RefSeq: reference sequence standards through 25 years of curation and annotation.
| S-EPMC11701664 | biostudies-literature

RefSeq curation and annotation of antizyme and antizyme inhibitor genes in vertebrates.
| S-EPMC4551939 | biostudies-literature

DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication.
| S-EPMC5860143 | biostudies-literature

MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation.
| S-EPMC4828917 | biostudies-literature

Mouse genome annotation by the RefSeq project.
| S-EPMC4602073 | biostudies-literature

EcoGene-RefSeq: EcoGene tools applied to the RefSeq prokaryotic genomes.
| S-EPMC3712216 | biostudies-literature