Project description:The corneal endothelium maintains corneal transparency; consequently, damage to this endothelium by a number of pathological conditions results in severe vision loss. Publicly available expression databases of human tissues are useful for investigating the pathogenesis of diseases and for developing new therapeutic modalities; however, databases for ocular tissues, and especially the corneal endothelium, are poor. Here, we have generated a transcriptome dataset from the ribosomal RNA-depleted total RNA from the corneal endothelium of eyes from seven Caucasians without ocular diseases. The results of principal component analysis and correlation coefficients (ranged from 0.87 to 0.96) suggested high homogeneity of our RNA-Seq dataset among the samples, as well as sufficient amount and quality. The expression profile of tissue-specific marker genes indicated only limited, if any, contamination by other layers of the cornea, while the Smirnov-Grubbs test confirmed the absence of outlier samples. The dataset presented here should be useful for investigating the function/dysfunction of the cornea, as well as for extended transcriptome analyses integrated with expression data for non-coding RNAs.
Project description:Experimental single-cell approaches are becoming widely used for many purposes, including investigation of the dynamic behaviour of developing biological systems. Consequently, a large number of computational methods for extracting dynamic information from such data have been developed. One example is RNA velocity analysis, in which spliced and unspliced RNA abundances are jointly modeled in order to infer a 'direction of change' and thereby a future state for each cell in the gene expression space. Naturally, the accuracy and interpretability of the inferred RNA velocities depend crucially on the correctness of the estimated abundances. Here, we systematically compare five widely used quantification tools, in total yielding thirteen different quantification approaches, in terms of their estimates of spliced and unspliced RNA abundances in five experimental droplet scRNA-seq data sets. We show that there are substantial differences between the quantifications obtained from different tools, and identify typical genes for which such discrepancies are observed. We further show that these abundance differences propagate to the downstream analysis, and can have a large effect on estimated velocities as well as the biological interpretation. Our results highlight that abundance quantification is a crucial aspect of the RNA velocity analysis workflow, and that both the definition of the genomic features of interest and the quantification algorithm itself require careful consideration.
Project description:Barcode swapping results in the mislabelling of sequencing reads between multiplexed samples on patterned flow-cell Illumina sequencing machines. This may compromise the validity of numerous genomic assays; however, the severity and consequences of barcode swapping remain poorly understood. We have used two statistical approaches to robustly quantify the fraction of swapped reads in two plate-based single-cell RNA-sequencing datasets. We found that approximately 2.5% of reads were mislabelled between samples on the HiSeq 4000, which is lower than previous reports. We observed no correlation between the swapped fraction of reads and the concentration of free barcode across plates. Furthermore, we have demonstrated that barcode swapping may generate complex but artefactual cell libraries in droplet-based single-cell RNA-sequencing studies. To eliminate these artefacts, we have developed an algorithm to exclude individual molecules that have swapped between samples in 10x Genomics experiments, allowing the continued use of cutting-edge sequencing machines for these assays.
Project description:Transposable elements (TEs), also known as "jumping genes", are repetitive sequences with the capability of changing their location within the genome. They are key players in many different biological processes in health and disease. Therefore, a reliable quantification of their expression as transcriptional units is crucial to distinguish between their independent expression and the transcription of their sequences as part of canonical transcripts. TEs quantification faces difficulties of different types, the most important one being low reads mappability due to their repetitive nature preventing an unambiguous mapping of reads originating from their sequences. A large fraction of TEs fragments localizes within introns, which led to the hypothesis that intron retention (IR) can be an additional source of bias, potentially affecting accurate TEs quantification. IR occurs when introns, normally removed from the mature transcript by the splicing machinery, are maintained in mature transcripts. IR is a widespread mechanism affecting many different genes with cell type-specific patterns. We hypothesized that, in an RNA-seq experiment, reads derived from retained introns can introduce a bias in the detection of overlapping, independent TEs RNA expression. In this study we performed meta-analysis using public RNA-seq data from lymphoblastoid cell lines and show that IR can impact TEs quantification using established tools with default parameters. Reads mapped on intronic TEs were indeed associated to the expression of TEs and influence their correct quantification as independent transcriptional units. We confirmed these results using additional independent datasets, demonstrating that this bias does not appear in samples where IR is not present and that differential TEs expression does not impact on IR quantification. We concluded that IR causes the over-quantification of intronic TEs and differential IR might be confused with differential TEs expression. Our results should be taken into account for a correct quantification of TEs expression from RNA-seq data, especially in samples in which IR is abundant.
Project description:Sequencing chromatin-associated RNA using libraries from the chromatin fraction makes it possible to characterize RNA processing driven by disassociated subunits. Here, we present an experimental strategy and computational pipeline for processing chromatin-associated RNA-seq data to detect and quantify readthrough transcripts. We describe steps for constructing degron mouse embryonic stem cells, detecting readthrough genes, data processing, and data analysis. This protocol can be adapted to various biological scenarios and other types of nascent RNA-seq, such as TT-seq. For complete details on the use and execution of this protocol, please refer to Li et al. (2023).1.
Project description:Chromatin in higher eukaryotic nuclei is extensively bound by various RNA species. We recently developed a method for in situ capture of global RNA interactions with DNA by deep sequencing (GRID-seq) of fixed permeabilized nuclei that allows identification of the entire repertoire of chromatin-associated RNAs in an unbiased manner. The experimental design of GRID-seq is related to those of two recently published strategies (MARGI (mapping RNA-genome interactions) and ChAR-seq (chromatin-associated RNA sequencing)), which also use a bivalent linker to ligate RNA and DNA in proximity. Importantly, however, GRID-seq also implements a combined experimental and computational approach to control nonspecific RNA-DNA interactions that are likely to occur during library construction, which is critical for accurate interpretation of detected RNA-DNA interactions. GRID-seq typically finds both coding and non-coding RNAs (ncRNAs) that interact with tissue-specific promoters and enhancers, especially super-enhancers, from which a global promoter-enhancer connectivity map can be deduced. Here, we provide a detailed protocol for GRID-seq that includes nuclei preparation, chromatin fragmentation, RNA and DNA in situ ligation with a bivalent linker, PCR amplification and high-throughput sequencing. To further enhance the utility of GRID-seq, we include a pipeline for data analysis, called GridTools, into which key steps such as background correction and inference of genomic element proximity are integrated. For researchers experienced in molecular biology with minimal bioinformatics skills, the protocol typically takes 4-5 d from cell fixation to library construction and 2-3 d for data processing.
Project description:BackgroundRibosomal proteins (RPs) have about 2000 pseudogenes in the human genome. While anecdotal reports for RP pseudogene transcription exists, it is unclear to what extent these pseudogenes are transcribed. The RP pseudogene transcription is difficult to identify in microarrays due to potential cross-hybridization between transcripts from the parent genes and pseudogenes. Recently, transcriptome sequencing (RNA-seq) provides an opportunity to ascertain the transcription of pseudogenes. A challenge for pseudogene expression discovery in RNA-seq data lies in the difficulty to uniquely identify reads mapped to pseudogene regions, which are typically also similar to the parent genes.ResultsHere we developed a specialized pipeline for pseudogene transcription discovery. We first construct a "composite genome" that includes the entire human genome sequence as well as mRNA sequences of real ribosomal protein genes. We then map all sequence reads to the composite genome, and only exact matches were retained. Moreover, we restrict our analysis to strictly defined mappable regions and calculate the RPKM values as measurement of pseudogene transcription levels. We report evidences for the transcription of RP pseudogenes in 16 human tissues. By analyzing the Human Body Map 2.0 study RNA-sequencing data using our pipeline, we identified that one ribosomal protein (RP) pseudogene (PGOHUM-249508) is transcribed with RPKM 170 in thyroid. Moreover, three other RP pseudogenes are transcribed with RPKM > 10, a level similar to that of the normal RP genes, in white blood cell, kidney, and testes, respectively. Furthermore, an additional thirteen RP pseudogenes are of RPKM > 5, corresponding to the 20-30 percentile among all genes. Unlike ribosomal protein genes that are constitutively expressed in almost all tissues, RP pseudogenes are differentially expressed, suggesting that they may contribute to tissue-specific biological processes.ConclusionsUsing a specialized bioinformatics method, we identified the transcription of ribosomal protein pseudogenes in human tissues using RNA-seq data.
Project description:SummaryWe have developed an RNA-Seq analysis workflow for single-ended Illumina reads, termed RseqFlow. This workflow includes a set of analytic functions, such as quality control for sequencing data, signal tracks of mapped reads, calculation of expression levels, identification of differentially expressed genes and coding SNPs calling. This workflow is formalized and managed by the Pegasus Workflow Management System, which maps the analysis modules onto available computational resources, automatically executes the steps in the appropriate order and supervises the whole running process. RseqFlow is available as a Virtual Machine with all the necessary software, which eliminates any complex configuration and installation steps.Availability and implementationhttp://genomics.isi.edu/rnaseqContactwangying@xmu.edu.cn; knowles@med.usc.edu; deelman@isi.edu; tingchen@usc.eduSupplementary informationSupplementary data are available at Bioinformatics online.
Project description:By measuring messenger RNA levels for all genes in a sample, RNA-seq provides an attractive option to characterize the global changes in transcription. RNA-seq is becoming the widely used platform for gene expression profiling. However, real transcription signals in the RNA-seq data are confounded with measurement and sequencing errors and other random biological/technical variation. To extract biologically useful transcription process from the RNA-seq data, we propose to use the second ODE for modeling the RNA-seq data. We use differential principal analysis to develop statistical methods for estimation of location-varying coefficients of the ODE. We validate the accuracy of the ODE model to fit the RNA-seq data by prediction analysis and 5-fold cross validation. To further evaluate the performance of the ODE model for RNA-seq data analysis, we used the location-varying coefficients of the second ODE as features to classify the normal and tumor cells. We demonstrate that even using the ODE model for single gene we can achieve high classification accuracy. We also conduct response analysis to investigate how the transcription process responds to the perturbation of the external signals and identify dozens of genes that are related to cancer.
Project description:Recent advances in high-throughput RNA sequencing (RNA-seq) have enabled tremendous leaps forward in our understanding of bacterial transcriptomes. However, computational methods for analysis of bacterial transcriptome data have not kept pace with the large and growing data sets generated by RNA-seq technology. Here, we present new algorithms, specific to bacterial gene structures and transcriptomes, for analysis of RNA-seq data. The algorithms are implemented in an open source software system called Rockhopper that supports various stages of bacterial RNA-seq data analysis, including aligning sequencing reads to a genome, constructing transcriptome maps, quantifying transcript abundance, testing for differential gene expression, determining operon structures and visualizing results. We demonstrate the performance of Rockhopper using 2.1 billion sequenced reads from 75 RNA-seq experiments conducted with Escherichia coli, Neisseria gonorrhoeae, Salmonella enterica, Streptococcus pyogenes and Xenorhabdus nematophila. We find that the transcriptome maps generated by our algorithms are highly accurate when compared with focused experimental data from E. coli and N. gonorrhoeae, and we validate our system's ability to identify novel small RNAs, operons and transcription start sites. Our results suggest that Rockhopper can be used for efficient and accurate analysis of bacterial RNA-seq data, and that it can aid with elucidation of bacterial transcriptomes.