Dataset Information

Loose ends: almost one in five human genes still have unresolved coding status.

ABSTRACT: Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects.

SUBMITTER: Abascal F

PROVIDER: S-EPMC6101605 | biostudies-literature | 2018 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Loose ends: almost one in five human genes still have unresolved coding status.

Abascal Federico F Juan David D Jungreis Irwin I Kellis Manolis M Martinez Laura L Rigau Maria M Rodriguez Jose Manuel JM Vazquez Jesus J Tress Michael L ML

Nucleic acids research 20180801 14

Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under prote ...[more]

PMID: 29982784

Dataset Information

Loose ends: almost one in five human genes still have unresolved coding status.

Publications

Loose ends: almost one in five human genes still have unresolved coding status.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

V(D)J recombination in zebrafish: Normal joining products with accumulation of unresolved coding ends and deleted signal ends.
| S-EPMC1785108 | biostudies-literature

Androgen Deprivation Therapy-Linked Cardiovascular Disease Risk: Still Unresolved.
| S-EPMC4836804 | biostudies-literature

Tying up the Loose Ends: A Mathematically Knotted Protein.
| S-EPMC8182377 | biostudies-literature

The GC-content at the 5' ends of human protein-coding genes is undergoing mutational decay.
| S-EPMC11323403 | biostudies-literature

Loose Ends in the Cortinarius Phylogeny: Five New Myxotelamonoid Species Indicate a High Diversity of These Ectomycorrhizal Fungi with South American Nothofagaceae.
| S-EPMC8148173 | biostudies-literature

Unresolved orthology and peculiar coding sequence properties of lamprey genes: the KCNA gene family as test case.
| S-EPMC3141671 | biostudies-literature

Tying up loose ends: the N-degron and C-degron pathways of protein degradation.
| S-EPMC7458402 | biostudies-literature

An atlas of human long non-coding RNAs with accurate 5' ends.
| S-EPMC6857182 | biostudies-literature

Electronic nicotine delivery systems (ENDS): not still ready to put on END.
| S-EPMC7399423 | biostudies-literature