Dataset Information

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

ABSTRACT: The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.

SUBMITTER: O'Leary NA

PROVIDER: S-EPMC4702849 | biostudies-literature | 2016 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

O'Leary Nuala A NA Wright Mathew W MW Brister J Rodney JR Ciufo Stacy S Haddad Diana D McVeigh Rich R Rajput Bhanu B Robbertse Barbara B Smith-White Brian B Ako-Adjei Danso D Astashyn Alexander A Badretdin Azat A Bao Yiming Y Blinkova Olga O Brover Vyacheslav V Chetvernin Vyacheslav V Choi Jinna J Cox Eric E Ermolaeva Olga O Farrell Catherine M CM Goldfarb Tamara T Gupta Tripti T Haft Daniel D Hatcher Eneida E Hlavina Wratko W Joardar Vinita S VS Kodali Vamsi K VK Li Wenjun W Maglott Donna D Masterson Patrick P McGarvey Kelly M KM Murphy Michael R MR O'Neill Kathleen K Pujar Shashikant S Rangwala Sanjida H SH Rausch Daniel D Riddick Lillian D LD Schoch Conrad C Shkeda Andrei A Storz Susan S SS Sun Hanzhen H Thibaud-Nissen Francoise F Tolstoy Igor I Tully Raymond E RE Vatsan Anjana R AR Wallin Craig C Webb David D Wu Wendy W Landrum Melissa J MJ Kimchi Avi A Tatusova Tatiana T DiCuccio Michael M Kitts Paul P Murphy Terence D TD Pruitt Kim D KD

Nucleic acids research 20151108 D1

The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. Th ...[more]

PMID: 26553804

Similar Datasets

Project description:BackgroundA vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based.ResultsWe describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome.ConclusionsThe reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.

Dataset Information

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

Publications

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets