Dataset Information

Systematic identification of pseudogenes through whole genome expression evidence profiling.

ABSTRACT: The identification of pseudogenes is an integral and significant part of the genome annotation because of their abundance and their impact on the experimental analysis of functional genes. Most of the computational annotation systems are not optimized for systematic pseudogene recognition, often annotating pseudogenes as functional genes, and users then propagate these errors to subsequent analyses and interpretations. In order to validate gene annotations and to identify pseudogenes that are potentially mis-annotated, we developed a novel approach based on whole genome profiling of existing transcript and protein sequences. This method has two important features: (i) equally detects both processed and non-processed pseudogenes and (ii) can identify transcribed pseudogenes. Applying this method to the human Ensembl gene predictions, we discovered that 2011 (9% of total) Ensembl genes in the categories of known and novel might be pseudogenes based on expression evidence. Of these, 1200 genes are found to have no existing evidence of transcription, and 811 genes are found with transcription evidence but contain significant translation disruption. Approximately 40% of the 2011 identified pseudogenes presented a multi-exon structure, representing non-processed pseudogenes. We have demonstrated the power of whole genome profiling of expression sequences to improve the accuracy of gene annotations.

SUBMITTER: Yao A

PROVIDER: S-EPMC1636364 | biostudies-literature | 2006

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Systematic identification of pseudogenes through whole genome expression evidence profiling.

Yao Alison A Charlab Rosane R Li Peter P

Nucleic acids research 20060831 16

The identification of pseudogenes is an integral and significant part of the genome annotation because of their abundance and their impact on the experimental analysis of functional genes. Most of the computational annotation systems are not optimized for systematic pseudogene recognition, often annotating pseudogenes as functional genes, and users then propagate these errors to subsequent analyses and interpretations. In order to validate gene annotations and to identify pseudogenes that are po ...[more]

PMID: 16945953

Similar Datasets

Project description:BackgroundTransposable elements (TEs) are DNA sequences that are able to move from their location in the genome by cutting or copying themselves to another locus. As such, they are increasingly recognized as impacting all aspects of genome function. With the dramatic reduction in cost of DNA sequencing, it is now possible to resequence whole genomes in order to systematically characterize novel TE mobilization in a particular individual. However, this task is made difficult by the inherently repetitive nature of TE sequences, which in some eukaryotes compose over half of the genome sequence. Currently, only a few software tools dedicated to the detection of TE mobilization using next-generation-sequencing are described in the literature. They often target specific TEs for which annotation is available, and are only able to identify families of closely related TEs, rather than individual elements.ResultsWe present TE-Tracker, a general and accurate computational method for the de-novo detection of germ line TE mobilization from re-sequenced genomes, as well as the identification of both their source and destination sequences. We compare our method with the two classes of existing software: specialized TE-detection tools and generic structural variant (SV) detection tools. We show that TE-Tracker, while working independently of any prior annotation, bridges the gap between these two approaches in terms of detection power. Indeed, its positive predictive value (PPV) is comparable to that of dedicated TE software while its sensitivity is typical of a generic SV detection tool. TE-Tracker demonstrates the benefit of adopting an annotation-independent, de novo approach for the detection of TE mobilization events. We use TE-Tracker to provide a comprehensive view of transposition events induced by loss of DNA methylation in Arabidopsis. TE-Tracker is freely available at http://www.genoscope.cns.fr/TE-Tracker .ConclusionsWe show that TE-Tracker accurately detects both the source and destination of novel transposition events in re-sequenced genomes. Moreover, TE-Tracker is able to detect all potential donor sequences for a given insertion, and can identify the correct one among them. Furthermore, TE-Tracker produces significantly fewer false positives than common SV detection programs, thus greatly facilitating the detection and analysis of TE mobilization events.

Project description:Pseudogenes are copies of genes that cannot produce a protein. They can be detected from disruptions to their apparent coding sequence, caused by frameshifts and premature stop codons. They are classed as either processed pseudogenes (made by reverse transcription from an mRNA) or duplicated pseudogenes, arising from duplication in the genomic DNA and subsequent disablement. Historically, there is anecdotal evidence that the fruit fly (Drosophila melanogaster) has few pseudogenes. Investigators have linked this to a high deletion rate of genomic DNA, for which there is evidence from genetic experiments on genome size. Here, we apply a homology-based pipeline that was developed previously to identify pseudogenes in other eukaryotic genomes, to the fruit fly, so as to derive the first complete survey of its pseudogene population. We find approximately 100 pseudogenes, with at least a sixth of these as candidate processed pseudogenes. This gives a much lower proportion of pseudogenes (compared with the size of the proteome) than in the genomes of other eukaryotes for which data are available (human, nematode and budding yeast). Closest matching proteins to Drosophila pseudogenes are significantly longer than the average protein in its proteome (up to approximately 60% more than the average protein's length), in contrast to the situation in the three other eukaryotic genomes. This may be due to the persistence of fragments of longer genes. In the fly pseudogene population, we found most pseudogenes for serine proteases (which are more abundant in the Drosophila lineage compared with the other eukaryotes), immunoglobulin-motif-containing proteins and cytochromes P450. Data on the sequences and positions of the putative pseudogenes are available at: http://www.pseudogene.org/fly. The detection of a small number of pseudogenes in the Drosophila genome and the higher mean length for the closest matching proteins to pseudogenes (possibly because remnants of genes encoding longer proteins are more likely to persist) are further evidence for a high deletion rate of genomic DNA in the fruit fly. The data are useful for molecular evolution study in Drosophila.

Project description:BackgroundWe have developed a gene expression assay (Whole-Genome DASL), capable of generating whole-genome gene expression profiles from degraded samples such as formalin-fixed, paraffin-embedded (FFPE) specimens.Methodology/principal findingsWe demonstrated a similar level of sensitivity in gene detection between matched fresh-frozen (FF) and FFPE samples, with the number and overlap of probes detected in the FFPE samples being approximately 88% and 95% of that in the corresponding FF samples, respectively; 74% of the differentially expressed probes overlapped between the FF and FFPE pairs. The WG-DASL assay is also able to detect 1.3-1.5 and 1.5-2 -fold changes in intact and FFPE samples, respectively. The dynamic range for the assay is approximately 3 logs. Comparing the WG-DASL assay with an in vitro transcription-based labeling method yielded fold-change correlations of R(2) approximately 0.83, while fold-change comparisons with quantitative RT-PCR assays yielded R(2) approximately 0.86 and R(2) approximately 0.55 for intact and FFPE samples, respectively. Additionally, the WG-DASL assay yielded high self-correlations (R(2)>0.98) with low intact RNA inputs ranging from 1 ng to 100 ng; reproducible expression profiles were also obtained with 250 pg total RNA (R(2) approximately 0.92), with approximately 71% of the probes detected in 100 ng total RNA also detected at the 250 pg level. When FFPE samples were assayed, 1 ng total RNA yielded self-correlations of R(2) approximately 0.80, while still maintaining a correlation of R(2) approximately 0.75 with standard FFPE inputs (200 ng).Conclusions/significanceTaken together, these results show that WG-DASL assay provides a reliable platform for genome-wide expression profiling in archived materials. It also possesses utility within clinical settings where only limited quantities of samples may be available (e.g. microdissected material) or when minimally invasive procedures are performed (e.g. biopsied specimens).

Dataset Information

Systematic identification of pseudogenes through whole genome expression evidence profiling.

Publications

Systematic identification of pseudogenes through whole genome expression evidence profiling.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets