Project description:BackgroundThe utility of long-read genome sequencing platforms has been shown in many fields including whole genome assembly, metagenomics, and amplicon sequencing. Less clear is the applicability of long reads to reference-guided human genomics, which is the foundation of genomic medicine. Here, we benchmark available platform-agnostic alignment tools on datasets from nanopore and single-molecule real-time platforms to understand their suitability in producing a genome representation.ResultsFor this study, we leveraged publicly-available data from sample NA12878 generated on Oxford Nanopore and sample NA24385 on Pacific Biosciences platforms. We employed state of the art sequence alignment tools including GraphMap2, long-read aligner (LRA), Minimap2, CoNvex Gap-cost alignMents for Long Reads (NGMLR), and Winnowmap2. Minimap2 and Winnowmap2 were computationally lightweight enough for use at scale, while GraphMap2 was not. NGMLR took a long time and required many resources, but produced alignments each time. LRA was fast, but only worked on Pacific Biosciences data. Each tool widely disagreed on which reads to leave unaligned, affecting the end genome coverage and the number of discoverable breakpoints. No alignment tool independently resolved all large structural variants (1,001-100,000 base pairs) present in the Database of Genome Variants (DGV) for sample NA12878 or the truthset for NA24385.ConclusionsThese results suggest a combined approach is needed for LRS alignments for human genomics. Specifically, leveraging alignments from three tools will be more effective in generating a complete picture of genomic variability. It should be best practice to use an analysis pipeline that generates alignments with both Minimap2 and Winnowmap2 as they are lightweight and yield different views of the genome. Depending on the question at hand, the data available, and the time constraints, NGMLR and LRA are good options for a third tool. If computational resources and time are not a factor for a given case or experiment, NGMLR will provide another view, and another chance to resolve a case. LRA, while fast, did not work on the nanopore data for our cluster, but PacBio results were promising in that those computations completed faster than Minimap2. Due to its significant burden on computational resources and slow run time, Graphmap2 is not an ideal tool for exploration of a whole human genome generated on a long-read sequencing platform.
Project description:MicroRNAs are vital gene expression regulators, extensively studied worldwide. The large-scale characterization of miRNAomes is possible using next-generation sequencing (NGS). This technology offers great opportunities, but these cannot be fully exploited without proper and comprehensive bioinformatics analysis. This may be achieved by the use of reliable dedicated software; however, different programs may generate divergent results, leading to additional discrepancies. Thus, the aim of this study was to compare three bioinformatic algorithms dedicated to NGS-based microRNA profiling and validate them using an alternative method, namely RT-qPCR. The comparison analysis revealed differences in the number and sets of identified miRNAs. The qPCR confirmed the expression of the investigated microRNAs. The correlation analysis of NGS and qPCR measurements showed strong and significant coefficients for a subset of the tested miRNAs, including those detected by all three algorithms. Single miRNA variants (isomiRs) showed different levels of correlation with the qPCR data. The obtained results revealed the good performance of all tested programs, despite the observed differences. Moreover, they implied that some specific miRNAs may be differentially estimated using NGS technology and the qPCR method, regardless of the used bioinformatics software. These discrepancies may stem from many factors, including the composition of the isomiR profile, their abundance, length, and investigated species. In conclusion, in this study, we shed light on the bioinformatics aspects of miRNAome profiling, elucidating its complexity and pinpointing potential features influencing validation. Thus, qPCR validation results should be open to interpretation when not fully concordant with NGS results until further, additional analyses are conducted.
Project description:BackgroundReference sequences play a vital role in next-generation sequencing (NGS), impacting mapping quality during genome analyses. However, reference genomes usually do not represent the full range of genetic diversity of a species as a result of geographical divergence and independent demographic events of different populations. For the mitochondrial genome (mitogenome), which occurs in high copy numbers in cells and is strictly maternally inherited, an optimal reference sequence has the potential to make mitogenome alignment both more accurate and more efficient. In this study, we used three different types of reference sequences for mitogenome mapping, i.e., the commonly used reference sequence (CU-ref), the breed-specific reference sequence (BS-ref) and the sample-specific reference sequence (SS-ref), respectively, and compared the accuracy of mitogenome alignment and SNP calling among them, for the purpose of proposing the optimal reference sequence for mitochondrial DNA (mtDNA) analyses of specific populations RESULTS: Four pigs, representing three different breeds, were high-throughput sequenced, subsequently mapping reads to the reference sequences mentioned above, resulting in a largest mapping ratio and a deepest coverage without increased running time when aligning reads to a BS-ref. Next, single nucleotide polymorphism (SNP) calling was carried out by 18 detection strategies with the three tools SAMtools, VarScan and GATK with different parameters, using the bam results mapping to BS-ref. The results showed that all eighteen strategies achieved the same high specificity and sensitivity, which suggested a high accuracy of mitogenome alignment by the BS-ref because of a low requirement for SNP calling tools and parameter choices.ConclusionsThis study showed that different reference sequences representing different genetic relationships to sample reads influenced mitogenome alignment, with the breed-specific reference sequences being optimal for mitogenome analyses, which provides a refined processing perspective for NGS data.
Project description:High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.
Project description:RNA, like DNA and proteins, can undergo modifications. To date, over 170 RNA modifications have been identified, leading to the emergence of a new research area known as epitranscriptomics. RNA editing is the most frequent RNA modification in mammalian transcriptomes, and two types have been identified: (1) the most frequent, adenosine to inosine (A-to-I); and (2) the less frequent, cysteine to uracil (C-to-U) RNA editing. Unlike other epitranscriptomic marks, RNA editing can be readily detected from RNA sequencing (RNA-seq) data without any chemical conversions of RNA before sequencing library preparation. Furthermore, analyzing RNA editing patterns from transcriptomic data provides an additional layer of information about the epitranscriptome. As the significance of epitranscriptomics, particularly RNA editing, gains recognition in various fields of biology and medicine, there is a growing interest in detecting RNA editing sites (RES) by analyzing RNA-seq data. To cope with this increased interest, several bioinformatic tools are available. However, each tool has its advantages and disadvantages, which makes the choice of the most appropriate tool for bench scientists and clinicians difficult. Here, we have benchmarked bioinformatic tools to detect RES from RNA-seq data. We provide a comprehensive view of each tool and its performance using previously published RNA-seq data to suggest recommendations on the most appropriate for utilization in future studies.
Project description:RNA-Seq, a deep sequencing technique, promises to be a potential successor to microarrays for studying the transcriptome. One of many aspects of transcriptomics that are of interest to researchers is gene expression estimation. With rapid development in RNA-Seq, there are numerous tools available to estimate gene expression, each producing different results. However, we do not know which of these tools produces the most accurate gene expression estimates. In this study we have addressed this issue using Cufflinks, IsoEM, HTSeq, and RSEM to quantify RNA-Seq expression profiles. Comparing results of these quantification tools, we observe that RNA-Seq relative expression estimates correlate with RT-qPCR measurements in the range of 0.85 to 0.89, with HTSeq exhibiting the highest correlation. But, in terms of root-mean-square deviation of RNA-Seq relative expression estimates from RT-qPCR measurements, we find HTSeq to produce the greatest deviation. Therefore, we conclude that, though Cufflinks, RSEM, and IsoEM might not correlate as well as HTSeq with RT-qPCR measurements, they may produce expression values with higher accuracy.
Project description:Noncoding small RNAs (sRNAs) packaged in bacterial outer membrane vesicles (OMVs) function as novel mediators of interspecies communication. While the role of bacterial sRNAs in enhancing virulence is well established, the role of sRNAs in the interaction between OMVs from phytopathogenic bacteria and their host plants remains unclear. In this study, we employ RNA sequencing to characterize differentially packaged sRNAs in OMVs of the phytopathogen Xanthomonas oryzae pv. oryzicola (Xoc). Our candidate sRNA (Xosr001) was abundant in OMVs and involved in the regulation of OsJMT1 to impair host stomatal immunity. Xoc loads Xosr001 into OMVs, which are specifically ttransferred into the mechanical tissues of rice leaves. Xosr001 suppresses OsJMT1 transcript accumulation in vivo, leading to a reduction in MeJA accumulation in rice leaves. Furthermore, the application of synthesized Xosr001 sRNA to the leaves of OsJMT1-HA-OE transgenic line results in the suppression of OsJMT1 expression by Xosr001. Notably, the OsJMT1-HA-OE transgenic line exhibited attenuated stomatal immunity and disease susceptibility upon infection with ΔXosr001 compared to Xoc. These results suggest that Xosr001 packaged in Xoc OMVs functions to suppress stomatal immunity in rice.
Project description:Whole genome bisulfite sequencing is currently at the forefront of epigenetic analysis, facilitating the nucleotide-level resolution of 5-methylcytosine (5mC) on a genome-wide scale. Specialized software have been developed to accommodate the unique difficulties in aligning such sequencing reads to a given reference, building on the knowledge acquired from model organisms such as human, or Arabidopsis thaliana. As the field of epigenetics expands its purview to non-model plant species, new challenges arise which bring into question the suitability of previously established tools. Herein, nine short-read aligners are evaluated: Bismark, BS-Seeker2, BSMAP, BWA-meth, ERNE-BS5, GEM3, GSNAP, Last and segemehl. Precision-recall of simulated alignments, in comparison to real sequencing data obtained from three natural accessions, reveals on-balance that BWA-meth and BSMAP are able to make the best use of the data during mapping. The influence of difficult-to-map regions, characterized by deviations in sequencing depth over repeat annotations, is evaluated in terms of the mean absolute deviation of the resulting methylation calls in comparison to a realistic methylome. Downstream methylation analysis is responsive to the handling of multi-mapping reads relative to mapping quality (MAPQ), and potentially susceptible to bias arising from the increased sequence complexity of densely methylated reads.
Project description:BackgroundAlignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment.ResultsHere, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events.ConclusionThe interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.