Project description:We present 'Threshold-seq,' a new approach for determining thresholds in deep-sequencing datasets of short RNA transcripts. Threshold-seq addresses the critical question of how many reads need to support a short RNA molecule in a given dataset before it can be considered different from 'background.' The proposed scheme is easy to implement and incorporate into existing pipelines.Source code of Threshold-seq is freely available as an R package at: http://cm.jefferson.edu/threshold-seq/.isidore.rigoutsos@jefferson.edu.Supplementary data are available at Bioinformatics online.
Project description:Manduca sexta is a large lepidopteran insect widely used as a model to study biochemistry of insect physiological processes. As a part of its genome project, over 50 cDNA libraries have been analyzed to profile gene expression in different tissues and life stages. While the RNA-seq data were used to study genes related to cuticle structure, chitin metabolism and immunity, a vast amount of the information has not yet been mined for understanding the basic molecular biology of this model insect. In fact, the basic features of these data, such as composition of the RNA-seq reads and lists of library-correlated genes, are unclear. From an extended view of all insects, clear-cut tempospatial expression data are rarely seen in the largest group of animals including Drosophila and mosquitoes, mainly due to their small sizes.We obtained the transcriptome data, analyzed the raw reads in relation to the assembled genome, and generated heatmaps for clustered genes. Library characteristics (tissues, stages), number of mapped bases, and sequencing methods affected the observed percentages of genome transcription. While up to 40% of the reads were not mapped to the genome in the initial Cufflinks gene modeling, we identified the causes for the mapping failure and reduced the number of non-mappable reads to <8%. Similarities between libraries, measured based on library-correlated genes, clearly identified differences among tissues or life stages. We calculated gene expression levels, analyzed the most abundantly expressed genes in the libraries. Furthermore, we analyzed tissue-specific gene expression and identified 18 groups of genes with distinct expression patterns.We performed a thorough analysis of the 67 RNA-seq datasets to characterize new genomic features of M. sexta. Integrated knowledge of gene functions and expression features will facilitate future functional studies in this biochemical model insect.
Project description:Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data.Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user.Polyester is freely available from Bioconductor (http://bioconductor.org/).jtleek@gmail.comSupplementary data are available at Bioinformatics online.
Project description:Single-cell RNA sequencing (scRNA-seq) technologies allow numerous opportunities for revealing novel and potentially unexpected biological discoveries. scRNA-seq clustering helps elucidate cell-to-cell heterogeneity and uncover cell subgroups and cell dynamics at the group level. Two important aspects of scRNA-seq data analysis were introduced and discussed in the present review: relevant datasets and analytical tools. In particular, we reviewed popular scRNA-seq datasets and discussed scRNA-seq clustering models including K-means clustering, hierarchical clustering, consensus clustering, and so on. Seven state-of-the-art scRNA clustering methods were compared on five public available datasets. Two primary evaluation metrics, the Adjusted Rand Index (ARI) and the Normalized Mutual Information (NMI), were used to evaluate these methods. Although unsupervised models can effectively cluster scRNA-seq data, these methods also have challenges. Some suggestions were provided for future research directions.
Project description:RNA-sequencing (RNA-seq) is a widely used approach for accessing the transcriptome in biomedical research. Studies frequently include multiple samples taken from the same individual at various time points or under different conditions, correct assignment of those samples to each particular participant is evidently of great importance. Here, we propose taking advantage of typing the highly polymorphic genes from the human leukocyte antigen (HLA) complex in order to verify the correct allocation of RNA-seq samples to individuals. We introduce RNA2HLA, a novel quality control (QC) tool for performing study-wide HLA-typing for RNA-seq data and thereby identifying the samples from the common source. RNA2HLA allows precise allocation and grouping of RNA samples based on their HLA types. Strikingly, RNA2HLA revealed wrongly assigned samples from publicly available datasets and thereby demonstrated the importance of this tool for the quality control of RNA-seq studies. In addition, our tool successfully extracts HLA alleles in four-digital resolution and can be used to perform massive HLA-typing from RNA-seq based studies, which will serve multiple research purposes beyond sample QC.
Project description:Although the number of RNA-Seq datasets deposited publicly has increased over the past few years, incomplete annotation of the associated metadata limits their potential use. Because of the importance of RNA splicing in diseases and biological processes, we constructed a database called SFMetaDB by curating datasets related with RNA splicing factors. Our effort focused on the RNA-Seq datasets in which splicing factors were knocked-down, knocked-out or over-expressed, leading to 75 datasets corresponding to 56 splicing factors. These datasets can be used in differential alternative splicing analysis for the identification of the potential targets of these splicing factors and other functional studies. Surprisingly, only ?15% of all the splicing factors have been studied by loss- or gain-of-function experiments using RNA-Seq. In particular, splicing factors with domains from a few dominant Pfam domain families have not been studied. This suggests a significant gap that needs to be addressed to fully elucidate the splicing regulatory landscape. Indeed, there are already mouse models available for ?20 of the unstudied splicing factors, and it can be a fruitful research direction to study these splicing factors in vitro and in vivo using RNA-Seq.
Project description:BACKGROUND:With the recent proliferation of single-cell RNA-Seq experiments, several methods have been developed for unsupervised analysis of the resulting datasets. These methods often rely on unintuitive hyperparameters and do not explicitly address the subjectivity associated with clustering. RESULTS:In this work, we present DendroSplit, an interpretable framework for analyzing single-cell RNA-Seq datasets that addresses both the clustering interpretability and clustering subjectivity issues. DendroSplit offers a novel perspective on the single-cell RNA-Seq clustering problem motivated by the definition of "cell type", allowing us to cluster using feature selection to uncover multiple levels of biologically meaningful populations in the data. We analyze several landmark single-cell datasets, demonstrating both the method's efficacy and computational efficiency. CONCLUSION:DendroSplit offers a clustering framework that is comparable to existing methods in terms of accuracy and speed but is novel in its emphasis on interpretabilty. We provide the full DendroSplit software package at https://github.com/jessemzhang/dendrosplit .
Project description:Trans-splicing mechanisms have been documented in many lineages that are widely distributed phylogenetically, including dinoflagellates. The spliced leader (SL) sequence itself is conserved in dinoflagellates, although its gene sequences and arrangements have diversified within or across different species. In this study, we present 18 Fugacium kawagutii SL genes identified from stranded RNA-seq reads. These genes typically have a single SL but can contain several partial SLs with lengths ranging from 103 to 292 bp. Unexpectedly, we find the SL gene transcripts contain sequences upstream of the canonical SL, suggesting that generation of mature transcripts will require additional modifications following trans-splicing. We have also identified 13 SL-like genes whose expression levels and length are comparable to Dino-SL genes. Lastly, introns in these genes were identified and a new site for Sm-protein binding was proposed. Overall, this study provides a strategy for fast identification of SL genes and identifies new sequences of F. kawagutii SL genes to supplement our understanding of trans-splicing.
Project description:Long non-coding RNAs (lncRNAs) are emerging as important regulatory molecules in developmental, physiological, and pathological processes. However, the precise mechanism and functions of most of lncRNAs remain largely unknown. Recent advances in high-throughput sequencing of immunoprecipitated RNAs after cross-linking (CLIP-Seq) provide powerful ways to identify biologically relevant protein-lncRNA interactions. In this study, by analyzing millions of RNA-binding protein (RBP) binding sites from 117 CLIP-Seq datasets generated by 50 independent studies, we identified 22,735 RBP-lncRNA regulatory relationships. We found that one single lncRNA will generally be bound and regulated by one or multiple RBPs, the combination of which may coordinately regulate gene expression. We also revealed the expression correlation of these interaction networks by mining expression profiles of over 6000 normal and tumor samples from 14 cancer types. Our combined analysis of CLIP-Seq data and genome-wide association studies data discovered hundreds of disease-related single nucleotide polymorphisms resided in the RBP binding sites of lncRNAs. Finally, we developed interactive web implementations to provide visualization, analysis, and downloading of the aforementioned large-scale datasets. Our study represented an important step in identification and analysis of RBP-lncRNA interactions and showed that these interactions may play crucial roles in cancer and genetic diseases.
Project description:Dilated cardiomyopathy (DCM) is one of the most common causes of heart failure. Several studies have used RNA-sequencing (RNA-seq) to profile differentially expressed genes (DEGs) associated with DCM. In this study, we aimed to profile gene expression signatures and identify novel genes associated with DCM through a quantitative meta-analysis of three publicly available RNA-seq studies using human left ventricle tissues from 41 DCM cases and 21 control samples. Our meta-analysis identified 789 DEGs including 581 downregulated and 208 upregulated genes. Several DCM-related genes previously reported, including MYH6, CKM, NKX2-5 and ATP2A2, were among the top 50 DEGs. Our meta-analysis also identified 39 new DEGs that were not detected using those individual RNA-seq datasets. Some of those genes, including PTH1R, ADAM15 and S100A4, confirmed previous reports of associations with cardiovascular functions. Using DEGs from this meta-analysis, the Ingenuity Pathway Analysis (IPA) identified five activated toxicity pathways, including failure of heart as the most significant pathway. Among the upstream regulators, SMARCA4 was downregulated and prioritized by IPA as the top affected upstream regulator for several DCM-related genes. To our knowledge, this study is the first to perform a transcriptomic meta-analysis for clinical DCM using RNA-seq datasets. Overall, our meta-analysis successfully identified a core set of genes associated with DCM.