Project description:RNA sequencing (RNA-seq) has been a widely used high-throughput method to characterize transcriptomic dynamics spatiotemporally. However, typical RNA-seq data analysis pipelines depend on either a sequenced genome or reference transcripts. This constriction makes the use of RNA-seq for species lacking both of sequenced genomes and reference transcripts challenging. To solve this problem, we developed CRSP, an RNA-seq pipeline integrating multiple comparative species strategy but not depending on a specific sequenced genome or reference transcripts. Benchmarking suggests the CRSP tool can achieve high accuracy to quantify gene expression levels.
Project description:Next-generation sequencing methods, such as RNA-seq, have permitted the exploration of gene expression in a range of organisms which have been studied in ecological contexts but lack a sequenced genome. However, the efficacy and accuracy of RNA-seq annotation methods using reference genomes from related species have yet to be robustly characterized. Here we conduct a comprehensive power analysis employing RNA-seq data from Drosophila melanogaster in conjunction with 11 additional genomes from related Drosophila species to compare annotation methods and quantify the impact of evolutionary divergence between transcriptome and the reference genome. Our analyses demonstrate that, regardless of the level of sequence divergence, direct genome mapping (DGM), where transcript short reads are aligned directly to the reference genome, significantly outperforms the widely used de novo and guided assembly-based methods in both the quantity and accuracy of gene detection. Our analysis also reveals that DGM recovers a more representative profile of Gene Ontology functional categories, which are often used to interpret emergent patterns in genomewide expression analyses. Lastly, analysis of available primate RNA-seq data demonstrates the applicability of our observations across diverse taxa. Our quantification of annotation accuracy and reduced gene detection associated with sequence divergence thus provides empirically derived guidelines for the design of future gene expression studies in species without sequenced genomes.
Project description:Microsatellites, also known as simple sequence repeats (SSRs), are the preferred type of marker for many genetic applications. In conjunction with the ongoing development of next-generation sequencing, several bioinformatic tools have been developed for identifying SSRs from genomic or transcriptomic sequences. Although these tools are handy for generating polymorphic SSRs, their application almost always depends on an existing reference genome or self-assembly of the reference genome. With this in mind, we propose a pipeline for developing polymorphic SSRs that may be applied to species without reference genomes. Using a species without a reference genome (black Amur bream; Megalobrama terminalis Richardson, 1846) as a model, our pipeline was able to effectively discover polymorphic SSRs. Under different R parameters of a reference-free single nucleotide polymorphisms (SNPs) caller (ebwt2InDel), a total of 258, 208, 102, and 11 polymorphic SSRs were mined. To quantify the accuracy of the polymorphic SSRs detected using our pipeline, we analyzed 25 SSRs with PCR experiments. All primers were successfully amplified, and most SSRs (23 SSRs, 92%) were polymorphic. From the 36 individual black Amur bream, we acquired an average of 3.36 alleles per locus, ranging from one to 11. This demonstrates the effectiveness of our pipeline in identifying polymorphic SSRs and designing primers for SSR genotyping. Ultimately, our pipeline can effectively mine polymorphic SSRs for species without reference genomes, complementing SSR mining approaches based on reference genomes and helping to resolve biological issues that accompany these methods.Supplementary informationThe online version contains supplementary material available at 10.1007/s13205-022-03313-0.
Project description:The use of reference DNA standards generated from cancer cell lines sequenced in the Cancer Genome Project to establish the sensitivity, specificity, accuracy and reproducibility of the WTSI GCLP sequencing pipeline
Project description:Metagenomic data compression is very important as metagenomic projects are facing the challenges of larger data volumes per sample and more samples nowadays. Reference-based compression is a promising method to obtain a high compression ratio. However, existing microbial reference genome databases are not suitable to be directly used as references for compression due to their large size and redundancy, and different metagenomic cohorts often have various microbial compositions. We present a novel pipeline that generated simplified and tailored reference genomes for large metagenomic cohorts, enabling the reference-based compression of metagenomic data. We constructed customized reference genomes, ranging from 2.4 to 3.9 GB, for 29 real metagenomic datasets and evaluated their compression performance. Reference-based compression achieved an impressive compression ratio of over 20 for human whole-genome data and up to 33.8 for all samples, demonstrating a remarkable 4.5 times improvement than the standard Gzip compression. Our method provides new insights into reference-based metagenomic data compression and has a broad application potential for faster and cheaper data transfer, storage, and analysis.
Project description:The quest for genes representing genetic relationships of strains or individuals within populations and their evolutionary history is acquiring a novel dimension of complexity with the advancement of next-generation sequencing (NGS) technologies. In fact, sequencing an entire genome uncovers genetic variation in coding and non-coding regions and offers the possibility of studying Saccharomyces cerevisiae populations at the strain level. Nevertheless, the disadvantageous cost-benefit ratio (the amount of details disclosed by NGS against the time-expensive and expertise-demanding data assembly process) still precludes the application of these techniques to the routinely assignment of yeast strains, making the selection of the most reliable molecular markers greatly desirable. In this work we propose an original computational approach to discover genes that can be used as a descriptor of the population structure. We found 13 genes whose variability can be used to recapitulate the phylogeny obtained from genome-wide sequences. The same approach that we prove to be successful in yeasts can be generalized to any other population of individuals given the availability of high-quality genomic sequences and of a clear population structure to be targeted.
Project description:Transposable elements (TEs) are ubiquitous in genomes. Many of these TEs remain active and are an important fraction of the transcriptomes with potential effects on the host genomes. The functional impact of TEs is well known for model organisms, however, in transcriptomes analysis of non-model organisms, this information is ignored due to the difficulty in identifying and quantifying TEs. Here we develop ExplorATE, a pipeline that allows the identification and quantification of active TEs in non-model organisms that can be easily implemented within the R environment. Based on simulated data, we show that our pipeline accurately identifies and quantifies TEs, over-performing the commonly used tools in model organisms. We show the implementation of ExplorATE using real data for RNA-seq samples from different tissues (liver, ovary, and brain) of Liolaemus parthenos, the only parthenogenetic lizard known to date in the entire clade Iguanidae (pleurodonta). Our results show that a significant fraction of the transcriptome contains repeats, however many of these are co-expressed with genes. The implementation of our pipeline in real data allowed the identification of the most abundant transposon families in each tissue. The ERV2, CR1, and SINE3 families were particularly abundant in the liver. A test data set is provided in the ExplorATE package.
Project description:HIV-1 proviral single-genome sequencing by limiting-dilution polymerase chain reaction (PCR) amplification is important for differentiating the sequence-intact from defective proviruses that persist during antiretroviral therapy (ART). Intact proviruses may rebound if ART is interrupted and are the barrier to an HIV cure. Oxford Nanopore Technologies (ONT) sequencing offers a promising, cost-effective approach to the sequencing of long amplicons such as near full-length HIV-1 proviruses, but the high diversity of HIV-1 and the ONT sequencing error render analysis of the generated data difficult. NanoHIV is a new tool that uses an iterative consensus generation approach to construct accurate, near full-length HIV-1 proviral single-genome sequences from ONT data. To validate the approach, single-genome sequences generated using NanoHIV consensus building were compared to Illumina® consensus building of the same nine single-genome near full-length amplicons and an average agreement of 99.4% was found between the two sequencing approaches.