Unknown

Dataset Information

0

Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22.


ABSTRACT: We have developed an initial approach for annotating and surveying pseudogenes in the human genome. We search human genomic DNA for regions that are similar to known protein sequences and contain obvious disablements (i.e., mid-sequence stop codons or frameshifts), while ensuring minimal overlap with annotations of known genes. Pseudogenes can be divided into "processed" and "nonprocessed"; the former are reverse transcribed from mRNA (and therefore have no intron structure), whereas the latter presumably arise from genomic duplications. We annotate putative processed pseudogenes based on whether there is a continuous span of homology that is >70% of the length of the closest matching human protein (i.e., with introns removed), or whether there is evidence of polyadenylation. We have applied our approach to chromosomes 21 and 22, the first parts of the human genome completely sequenced, finding 190 new pseudogene annotations beyond the 264 reported by the sequencing centers. In total, on chromosomes 21 and 22, there are 189 processed pseudogenes, 195 nonprocessed pseudogenes, and, additionally, 70 pseudogenic immunoglobulin gene segments. (Detailed assignments are available at http://bioinfo.mbb.yale.edu/genome/pseudogene or http://genecensus.org/pseudogene.) By extrapolation, we predict that there could be up to approximately 20,000 pseudogenes in the whole human genome, with a little more than half of them processed. We have determined the main populations and clusters of pseudogenes on chromosomes 21 and 22. There are notable excesses of pseudogenes relative to genes near the centromeres of both chromosomes, indicating the existence of pseudogenic "hot-spots" in the genome. We have looked at the distribution of InterPro families and Gene Ontology (GO) functional categories in our pseudogenes. Overall, the families in both processed and nonprocessed pseudogene populations occur according to a similar power-law distribution as that found for the occurrence of gene families, with a few big families and many small ones. The processed population is, in particular, enriched in highly expressed ribosomal-protein sequences (approximately 20%), which appear fairly evenly distributed across the chromosomes. We compared processed pseudogenes of different evolutionary ages, observing a high degree of similarity between "ancient" and "modern" subpopulations. This may be attributable to the consistently high expression of ribosomal proteins over evolutionary time. Finally, we find that chromosome 22 pseudogene population is dominated by immunoglobulin segments, which have a greater rate of disablement per amino acid than the other pseudogene populations and are also substantially more diverged.

SUBMITTER: Harrison PM 

PROVIDER: S-EPMC155275 | biostudies-literature | 2002 Feb

REPOSITORIES: biostudies-literature

altmetric image

Publications

Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22.

Harrison Paul M PM   Hegyi Hedi H   Balasubramanian Suganthi S   Luscombe Nicholas M NM   Bertone Paul P   Echols Nathaniel N   Johnson Ted T   Gerstein Mark M  

Genome research 20020201 2


We have developed an initial approach for annotating and surveying pseudogenes in the human genome. We search human genomic DNA for regions that are similar to known protein sequences and contain obvious disablements (i.e., mid-sequence stop codons or frameshifts), while ensuring minimal overlap with annotations of known genes. Pseudogenes can be divided into "processed" and "nonprocessed"; the former are reverse transcribed from mRNA (and therefore have no intron structure), whereas the latter  ...[more]

Similar Datasets

| S-EPMC122450 | biostudies-literature
| S-EPMC122594 | biostudies-literature
| S-EPMC463270 | biostudies-literature
| S-EPMC353210 | biostudies-literature
| S-EPMC187539 | biostudies-literature
2009-08-13 | GSE17600 | GEO
2009-07-28 | GSE17358 | GEO
2010-05-15 | E-GEOD-17358 | biostudies-arrayexpress
| S-EPMC149191 | biostudies-literature
2012-02-02 | GSE35475 | GEO