Project description:Circular RNA (circRNA) is mainly generated by the splice donor of a downstream exon joining to an upstream splice acceptor, a phenomenon known as backsplicing. It has been reported that circRNA can function as microRNA (miRNA) sponges, transcriptional regulators, or potential biomarkers. The availability of massive non-polyadenylated transcriptomes data has facilitated the genome-wide identification of thousands of circRNAs. Several circRNA detection tools or pipelines have recently been developed, and it is essential to provide useful guidelines on these pipelines for users, including a comprehensive and unbiased comparison. Here, we provide an improved and easy-to-use circRNA read simulator that can produce mimicking backsplicing reads supporting circRNAs deposited in CircBase. Moreover, we compared the performance of 11 circRNA detection tools on both simulated and real datasets. We assessed their performance regarding metrics such as precision, sensitivity, F1 score, and Area under Curve. It is concluded that no single method dominated on all of these metrics. Among all of the state-of-the-art tools, CIRI, CIRCexplorer, and KNIFE, which achieved better balanced performance between their precision and sensitivity, compared favorably to the other methods.
Project description:CircRNAs are novel members of the non-coding RNA family. For several decades circRNAs have been known to exist, however only recently the widespread abundance has become appreciated. Annotation of circRNAs depends on sequencing reads spanning the backsplice junction and therefore map as non-linear reads in the genome. Several pipelines have been developed to specifically identify these non-linear reads and consequently predict the landscape of circRNAs based on deep sequencing datasets. Here, we use common RNAseq datasets to scrutinize and compare the output from five different algorithms; circRNA_finder, find_circ, CIRCexplorer, CIRI, and MapSplice and evaluate the levels of bona fide and false positive circRNAs based on RNase R resistance. By this approach, we observe surprisingly dramatic differences between the algorithms specifically regarding the highly expressed circRNAs and the circRNAs derived from proximal splice sites. Collectively, this study emphasizes that circRNA annotation should be handled with care and that several algorithms should ideally be combined to achieve reliable predictions.
Project description:RNA, like DNA and proteins, can undergo modifications. To date, over 170 RNA modifications have been identified, leading to the emergence of a new research area known as epitranscriptomics. RNA editing is the most frequent RNA modification in mammalian transcriptomes, and two types have been identified: (1) the most frequent, adenosine to inosine (A-to-I); and (2) the less frequent, cysteine to uracil (C-to-U) RNA editing. Unlike other epitranscriptomic marks, RNA editing can be readily detected from RNA sequencing (RNA-seq) data without any chemical conversions of RNA before sequencing library preparation. Furthermore, analyzing RNA editing patterns from transcriptomic data provides an additional layer of information about the epitranscriptome. As the significance of epitranscriptomics, particularly RNA editing, gains recognition in various fields of biology and medicine, there is a growing interest in detecting RNA editing sites (RES) by analyzing RNA-seq data. To cope with this increased interest, several bioinformatic tools are available. However, each tool has its advantages and disadvantages, which makes the choice of the most appropriate tool for bench scientists and clinicians difficult. Here, we have benchmarked bioinformatic tools to detect RES from RNA-seq data. We provide a comprehensive view of each tool and its performance using previously published RNA-seq data to suggest recommendations on the most appropriate for utilization in future studies.
Project description:We have benchmarked the performance of cancer CNV calling by six most recent software tools on their detection accuracy, sensitivity, and reproducibility. We also explored the consistency of CNV calling across different orthogonal technologies, including optical mapping and microarrays. Using consensus results from six CNV callers and confirmation from three orthogonal methods, we established a high-confidence CNV call set for the reference sample.
Project description:BackgroundWhile RNA is often created from linear splicing during transcription, recent studies have found that non-canonical splicing sometimes occurs. Non-canonical splicing joins 3' and 5' and forms the so-called circular RNA. It is now believed that circular RNA plays important biological roles such as affecting susceptibility of some diseases. During the past several years, multiple experimental methods have been developed to enrich circular RNA while degrade linear RNA. Although several useful software tools for circular RNA detection have been developed as well, these tools are based on reads mapping may miss many circular RNA. Also, existing tools are slow for large data due to their dependence on reads mapping.MethodIn this paper, we present a new computational approach, named CircMarker, based on k-mers rather than reads mapping for circular RNA detection. CircMarker takes advantage of transcriptome annotation files to create the k-mer table for circular RNA detection.ResultsEmpirical results show that CircMarker outperforms existing tools in circular RNA detection on accuracy and efficiency in many simulated and real datasets.ConclusionsWe develop a new circular RNA detection method called CircMarker based on k-mer analysis. Our results on both simulation data and real data demonstrate that CircMarker runs much faster and can find more circular RNA with higher consensus-based sensitivity and high accuracy ratio compared with existing tools.
Project description:IntroductionAs new computational tools for detecting phage in metagenomes are being rapidly developed, a critical need has emerged to develop systematic benchmarks.MethodsIn this study, we surveyed 19 metagenomic phage detection tools, 9 of which could be installed and run at scale. Those 9 tools were assessed on several benchmark challenges. Fragmented reference genomes are used to assess the effects of fragment length, low viral content, phage taxonomy, robustness to eukaryotic contamination, and computational resource usage. Simulated metagenomes are used to assess the effects of sequencing and assembly quality on the tool performances. Finally, real human gut metagenomes and viromes are used to assess the differences and similarities in the phage communities predicted by the tools.ResultsWe find that the various tools yield strikingly different results. Generally, tools that use a homology approach (VirSorter, MARVEL, viralVerify, VIBRANT, and VirSorter2) demonstrate low false positive rates and robustness to eukaryotic contamination. Conversely, tools that use a sequence composition approach (VirFinder, DeepVirFinder, Seeker), and MetaPhinder, have higher sensitivity, including to phages with less representation in reference databases. These differences led to widely differing predicted phage communities in human gut metagenomes, with nearly 80% of contigs being marked as phage by at least one tool and a maximum overlap of 38.8% between any two tools. While the results were more consistent among the tools on viromes, the differences in results were still significant, with a maximum overlap of 60.65%. Discussion: Importantly, the benchmark datasets developed in this study are publicly available and reusable to enable the future comparability of new tools developed.
Project description:Alternatively spliced transcript isoforms are commonly observed in higher eukaryotes. The expression levels of these isoforms are key for understanding normal functions in healthy tissues and the progression of disease states. However, accurate quantification of expression at the transcript level is limited with current RNA-seq technologies because of, for example, limited read length and the cost of deep sequencing.A large number of tools have been developed to tackle this problem, and we performed a comprehensive evaluation of these tools using both experimental and simulated RNA-seq datasets. We found that recently developed alignment-free tools are both fast and accurate. The accuracy of all methods was mainly influenced by the complexity of gene structures and caution must be taken when interpreting quantification results for short transcripts. Using TP53 gene simulation, we discovered that both sequencing depth and the relative abundance of different isoforms affect quantification accuracy CONCLUSIONS: Our comprehensive evaluation helps data analysts to make informed choice when selecting computational tools for isoform quantification.