Unknown

Dataset Information

0

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.


ABSTRACT: The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.

SUBMITTER: O'Leary NA 

PROVIDER: S-EPMC4702849 | biostudies-literature | 2016 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

O'Leary Nuala A NA   Wright Mathew W MW   Brister J Rodney JR   Ciufo Stacy S   Haddad Diana D   McVeigh Rich R   Rajput Bhanu B   Robbertse Barbara B   Smith-White Brian B   Ako-Adjei Danso D   Astashyn Alexander A   Badretdin Azat A   Bao Yiming Y   Blinkova Olga O   Brover Vyacheslav V   Chetvernin Vyacheslav V   Choi Jinna J   Cox Eric E   Ermolaeva Olga O   Farrell Catherine M CM   Goldfarb Tamara T   Gupta Tripti T   Haft Daniel D   Hatcher Eneida E   Hlavina Wratko W   Joardar Vinita S VS   Kodali Vamsi K VK   Li Wenjun W   Maglott Donna D   Masterson Patrick P   McGarvey Kelly M KM   Murphy Michael R MR   O'Neill Kathleen K   Pujar Shashikant S   Rangwala Sanjida H SH   Rausch Daniel D   Riddick Lillian D LD   Schoch Conrad C   Shkeda Andrei A   Storz Susan S SS   Sun Hanzhen H   Thibaud-Nissen Francoise F   Tolstoy Igor I   Tully Raymond E RE   Vatsan Anjana R AR   Wallin Craig C   Webb David D   Wu Wendy W   Landrum Melissa J MJ   Kimchi Avi A   Tatusova Tatiana T   DiCuccio Michael M   Kitts Paul P   Murphy Terence D TD   Pruitt Kim D KD  

Nucleic acids research 20151108 D1


The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. Th  ...[more]

Similar Datasets

| S-EPMC2686572 | biostudies-literature
| S-EPMC29787 | biostudies-literature
| S-EPMC5001611 | biostudies-literature
| S-EPMC3245000 | biostudies-literature
| S-EPMC5824777 | biostudies-literature
| S-EPMC4602073 | biostudies-literature
| S-EPMC4502323 | biostudies-literature
| S-EPMC3965069 | biostudies-literature
| S-EPMC3965018 | biostudies-literature
| S-EPMC8016462 | biostudies-literature