Dataset Information

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome.

ABSTRACT: Processed pseudogenes were created by reverse-transcription of mRNAs; they provide snapshots of ancient genes existing millions of years ago in the genome. To find them in the present-day human, we developed a pipeline using features such as intron-absence, frame-disruption, polyadenylation, and truncation. This has enabled us to identify in recent genome drafts approximately 8000 processed pseudogenes (distributed from http://pseudogene.org). Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. Their chromosomal distribution appears random and dispersed, with the numbers on chromosomes proportional to length, suggesting sustained "bombardment" over evolution. However, it does vary with GC-content: Processed pseudogenes occur mostly in intermediate GC-content regions. This is similar to Alus but contrasts with functional genes and L1-repeats. Pseudogenes, moreover, have age profiles similar to Alus. The number of pseudogenes associated with a given gene follows a power-law relationship, with a few genes giving rise to many pseudogenes and most giving rise to few. The prevalence of processed pseudogenes agrees well with germ-line gene expression. Highly expressed ribosomal proteins account for approximately 20% of the total. Other notables include cyclophilin-A, keratin, GAPDH, and cytochrome c.

SUBMITTER: Zhang Z

PROVIDER: S-EPMC403796 | biostudies-literature | 2003 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome.

Zhang Zhaolei Z Harrison Paul M PM Liu Yin Y Gerstein Mark M

Genome research 20031201 12

Processed pseudogenes were created by reverse-transcription of mRNAs; they provide snapshots of ancient genes existing millions of years ago in the genome. To find them in the present-day human, we developed a pipeline using features such as intron-absence, frame-disruption, polyadenylation, and truncation. This has enabled us to identify in recent genome drafts approximately 8000 processed pseudogenes (distributed from http://pseudogene.org). Overall, processed pseudogenes are very similar to t ...[more]

PMID: 14656962

Similar Datasets

Project description:Processed pseudogenes result from reverse transcribed mRNAs. In general, because processed pseudogenes lack promoters, they are no longer functional from the moment they are inserted into the genome. Subsequently, they freely accumulate substitutions, insertions and deletions. Moreover, the ancestral structure of processed pseudogenes could be easily inferred using the sequence of their functional homologous genes. Owing to these characteristics, processed pseudogenes represent good neutral markers for studying genome evolution. Recently, there is an increasing interest for these markers, particularly to help gene prediction in the field of genome annotation, functional genomics and genome evolution analysis (patterns of substitution). For these reasons, we have developed a method to annotate processed pseudogenes in complete genomes. To make them useful to different fields of research, we stored them in a nucleic acid database after having annotated them. In this work, we screened both mouse and human complete genomes from ENSEMBL to find processed pseudogenes generated from functional genes with introns. We used a conservative method to detect processed pseudogenes in order to minimize the rate of false positive sequences. Within processed pseudogenes, some are still having a conserved open reading frame and some have overlapping gene locations. We designated as retroelements all reverse transcribed sequences and more strictly, we designated as processed pseudogenes, all retroelements not falling in the two former categories (having a conserved open reading or overlapping gene locations). We annotated 5823 retroelements (5206 processed pseudogenes) in the human genome and 3934 (3428 processed pseudogenes) in the mouse genome. Compared to previous estimations, the total number of processed pseudogenes was underestimated but the aim of this procedure was to generate a high-quality dataset. To facilitate the use of processed pseudogenes in studying genome structure and evolution, DNA sequences from processed pseudogenes, and their functional reverse transcribed homologs, are now stored in a nucleic acid database, HOPPSIGEN. HOPPSIGEN can be browsed on the PBIL (Pole Bioinformatique Lyonnais) World Wide Web server (http://pbil.univ-lyon1.fr/) or fully downloaded for local installation.

Project description:Pseudogenes, in the case of protein-coding genes, are gene copies that have lost the ability to code for a protein; they are typically identified through annotation of disabled, decayed or incomplete protein-coding sequences. Processed pseudogenes (PPsigs) are made through mRNA retrotransposition. There is overwhelming genomic evidence for thousands of human PPsigs and also dozens of human processed genes that comprise complete retrotransposed copies of other genes. Here, we survey for an intermediate entity, the transcribed processed pseudogene (TPPsig), which is disabled but nonetheless transcribed. TPPsigs may affect expression of paralogous genes, as observed in the case of the mouse makorin1-p1 TPPsig. To elucidate their role, we identified human TPPsigs by mapping expressed sequences onto PPsigs and, reciprocally, extracting TPPsigs from known mRNAs. We consider only those PPsigs that are homologous to either non-mammalian eukaryotic proteins or protein domains of known structure, and require detection of identical coding-sequence disablements in both the expressed and genomic sequences. Oligonucleotide microarray data provide further expression verification. Overall, we find 166-233 TPPsigs ( approximately 4-6% of PPsigs). Proteins/transcripts with the highest numbers of homologous TPPsigs generally have many homologous PPsigs and are abundantly expressed. TPPsigs are significantly over-represented near both the 5' and 3' ends of genes; this suggests that TPPsigs can be formed through gene-promoter co-option, or intrusion into untranslated regions. However, roughly half of the TPPsigs are located away from genes in the intergenic DNA and thus may be co-opting cryptic promoters of undesignated origin. Furthermore, TPPsigs are unlike other PPsigs and processed genes in the following ways: (i) they do not show a significant tendency to either deposit on or originate from the X chromosome; (ii) only 5% of human TPPsigs have potential orthologs in mouse. This latter finding indicates that the vast majority of TPPsigs is lineage specific. This is likely linked to well-documented extensive lineage-specific SINE/LINE activity. The list of TPPsigs is available at: http://www.biology.mcgill.ca/faculty/harrison/tppg/bppg.tov (or) http:pseudogene.org.

Dataset Information

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome.

Publications

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets