Project description:Purpose: To demonstrate that gene expression and splicing analysis varies considerably depending on the mapping reference genome. Methods: We mapped and analyzed submitted RNA reads using different tools and reference genomes to evaluate the influence of genome on DEG and alternative splicing tools. Results: We observed that these differences in transcriptome analysis are, in part, due to the presence of single nucleotide polymorphisms between the sequenced individual and each respective reference genome, as well as annotation differences between the reference genomes that exist even between syntenic orthologs. Conclusion: We conclude that even between two closely related genomes of similar quality, using the reference genome that is most closely related to the species being sampled significantly improves transcriptome.
Project description:Background: RNA-seq based on short reads generated by next generation sequencing technologies has become the main approach to study differential gene expression. Until now the main applications of this technique have been to study the variation of gene expression in a whole organism, tissue or cell type under different conditions or at different developmental stages. However, RNA-seq also has a great potential to be used in evolutionary studies to investigate gene expression divergence in closely related species. Since the more reliable statistical methods for differential gene expression inference are based on the use of raw read count data, the reference genomes of the species to be compared need to be highly comparable. Results: We show that the published genomes and annotations of the three closely related Drosophila species, D. melanogaster, D. simulans and D. mauritiana, have limitations for inter-specific gene expression studies. This is due to missing gene models in at least one of the genome annotations, unclear orthology assignments and significant length differences in the different species. We propose that published reference genomes should be re-annotated before using them as references for RNA-seq experiments to include as many genes as possible and to account for a potential length bias. For that we present a straight-forward reciprocal re-annotation pipeline that allows to reliably compare the expression for nearly all genes annotated in D. melanogaster. We carried out a RNA-seq experiment in combination with quantitative real-time PCR to confirm that the newly generated gene sets do not result in a high number of false positives as observed with references that still show a clear length difference of gene models between species. Conclusions: We conclude that our reciprocal re-annotation of previously published genomes facilitates the analysis of significantly more genes in an inter-specific differential gene expression study. We propose that the established pipeline can easily be applied to re-annotate other genomes of closely related animals and plants to improve comparative expression analyses.
Project description:Gene expression was quanitified in 4 naive corneas from BALB/c and 4 corneas from C57BL/6N mice without intervention by RNAseq of total RNA with the Ovation Kit for model organisms. To avoid false positive differential expression from better alignment of the reads from C57BL/6 mice to the reference representing a closely related strain while retaining the applicability of the standard reference genome annotation, two pseudogenomes were generated incorporating the known variants into the reference and aligning to the resulting genomes. BAM files were then converted with Lapels to the standard reference, which includes conversion of genome coordinates and adjusting CIGAR strings. Then expression quantification is possible with respect to the standard gene model (here Ensembl version 94) again.
Project description:With the creation of accurate, chromosome-scale genomes, the next challenge facing the genomics community is the accurate idenfication of transcriptional units, distinguishing them from aberrant transcriptional noise. This has proven to be a challenge as annotation by traditional means, such as short read RNA-seq followed by transcriptome assembly, which is prone to the generation of in-silico artifacts. To address this issue, we took advantage of epigenomic data in the form of ChIP-seq to unbiasedly annotate plant genomes and identify potential annotation issues, as well as identify novel genes. Histone modifications appear in the genome in a reproducible and predictable manner, making them an ideal resource to use in annotation. Trimethylation of histone 3 lysine 4 (H3K4me3), as well as acetylation of histone 3 lysine 56 are well documented to coincide with initiation of transcription by polymerase II (Pol II) at promoter sequences. These initiation marks, paired with marks deposited across the gene body during transcriptional elongation, such as histone 3 lysine 36 tri-methylation (H3K36me3) and histone 3 lysine 4 mono-methylation (H3K4me1), offer a framework to begin identifying complete transcriptional units. We leveraged these data on a genome-wide scale, allowing for identification of annotations discordant with empirical data. In total, 13,159 potential annotation issues were found in Zea mays across three different tissues, which were corroborated using complementary RNA-based approaches. Upon correction and validation, genes were extended by an average of 2,128 base pairs, and the length of discovered novel genes was 1,962 base pairs. Application of this method to five additional plant genomes revealed a variety of novel gene annotations, including 13,836 in Asparagus officianalis, 2,724 in Setaria viridis, 2,446 in Sorghum bicolor, 8,631 in Glycine max, and 2,585 in Phaseolous vulgaris.
Project description:The model organism Encyclopedia of DNA Elements project (modENCODE) has produced a comprehensive annotation of D. melanogaster transcript models based on an enormous amount of high-throughput experimental data. However, some transcribed elements may not be functional, and technical artifacts may lead to erroneous inference of transcription. Inter-species comparison provides confidence to predicted annotation, since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function. We have performed RNA-Seq and CAGE-Seq experiments on more than 80 samples from multiple tissues and stages of 15 Drosophila species, including 8 previously unsequenced genomes. We have found strikingly conserved sequence, expression, and splicing for the vast majority of transcript models in modENCODE annotation (e.g. 99% exons of coding sequences (CDS), 88% exons of untranslated regions (UTR), and 87% splicing events), indicating that the transcriptome annotation is of very high quality. We also describe dynamic transcriptome evolution within the Drosophila genus, including conserved promoter structure, labile positions of transcription start sites, and rapidly evolving RNA-editing events. We demonstrate how this phylogenetic approach to DNA element validation will prove useful in the annotation of other high priority genomes, especially for genomes that are less compact than Drosophila (e.g. the vast majority of vertebrate genomes). Refer to individual Series (listed below).
Project description:The model organism Encyclopedia of DNA Elements project (modENCODE) has produced a comprehensive annotation of D. melanogaster transcript models based on an enormous amount of high-throughput experimental data. However, some transcribed elements may not be functional, and technical artifacts may lead to erroneous inference of transcription. Inter-species comparison provides confidence to predicted annotation, since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function. We have performed RNA-Seq and CAGE-Seq experiments on more than 80 samples from multiple tissues and stages of 15 Drosophila species, including 8 previously unsequenced genomes. We have found strikingly conserved sequence, expression, and splicing for the vast majority of transcript models in modENCODE annotation (e.g. 99% exons of coding sequences (CDS), 88% exons of untranslated regions (UTR), and 87% splicing events), indicating that the transcriptome annotation is of very high quality. We also describe dynamic transcriptome evolution within the Drosophila genus, including conserved promoter structure, labile positions of transcription start sites, and rapidly evolving RNA-editing events. We demonstrate how this phylogenetic approach to DNA element validation will prove useful in the annotation of other high priority genomes, especially for genomes that are less compact than Drosophila (e.g. the vast majority of vertebrate genomes).