Project description:A good physical map is essential to guide sequence assembly in de novo whole genome sequencing, especially when sequences are produced by high-throughput sequencing such as next-generation-sequencing (NGS) technology. We here present a novel method, Feature sets-based Genome Mapping (FGM). With FGM, physical map and draft whole genome sequences can be generated, anchored and integrated using the same data set of NGS sequences, independent of restriction digestion. Method model was created and parameters were inspected by simulations using the Arabidopsis genome sequence. In the simulations, when ~4.8X genome BAC library including 4,096 clones was used to sequence the whole genome, ~90% of clones were successfully connected to physical contigs, and 91.58% of genome sequences were mapped and connected to chromosomes. This method was experimentally verified using the existing physical map and genome sequence of rice. Of 4,064 clones covering 115 Mb sequence selected from ~3 tiles of 3 chromosomes of a rice draft physical map, 3,364 clones were reconstructed into physical contigs and 98 Mb sequences were integrated into the 3 chromosomes. The physical map-integrated draft genome sequences can provide permanent frameworks for eventually obtaining high-quality reference sequences by targeted sequencing, gap filling and combining other sequences.
Project description:Gene expression data generated from whole blood via next generation sequencing is frequently used in studies aimed at identifying mRNA-based biomarker panels with utility for diagnosis or monitoring of human disease. These investigations often employ data normalization techniques more typically used for analysis of data originating from solid tissues, which largely operate under the general assumption that specimens have similar transcriptome composition. However, this assumption may be violated when working with data generated from whole blood, which is more cellularly dynamic, leading to potential confounds. In this study, we used next generation sequencing in combination with flow cytometry to assess the influence of donor leukocyte counts on the transcriptional composition of whole blood specimens sampled from a cohort of 138 human subjects, and then subsequently examined the effect of four frequently used data normalization approaches on our ability to detect inter-specimen biological variance, using the flow cytometry data to benchmark each specimens true cellular and molecular identity. Whole blood samples originating from donors with differing leukocyte counts exhibited dramatic differences in both genome-wide distributions of transcript abundance and gene-level expression patterns. Consequently, three of the normalization strategies we tested, including median ratio (MRN), trimmed mean of m-values (TMM), and quantile normalization, noticeably masked the true biological structure of the data and impaired our ability to detect true interspecimen differences in mRNA levels. The only strategy that improved our ability to detect true biological variance was simple scaling of read counts by sequencing depth, which unlike the aforementioned approaches, makes no assumptions regarding transcriptome composition.
Project description:A challenge of next generation sequencing is read contamination. We use Genotype-Tissue Expression (GTEx) datasets and technical metadata along with RNA-seq datasets from other studies to understand factors that contribute to contamination. Here we report, of 48 analyzed tissues in GTEx, 26 have variant co-expression clusters of four highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicate contamination. Sample contamination is strongly associated with a sample being sequenced on the same day as a tissue that natively expresses those genes. Discrepant SNPs across four contaminating genes validate the contamination. Low-level contamination affects ~40% of samples and leads to numerous eQTL assignments in inappropriate tissues among these 18 genes. This type of contamination occurs widely, impacting bulk and single cell (scRNA-seq) data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses.
Project description:Many questions can be explored thanks to whole-genome data. The aim of this study was to overcome their main limits, software availability and database accuracy, and estimate the feasibility of red blood cell (RBC) antigen typing from whole-genome sequencing (WGS) data. We analyzed whole-genome data from 79 individuals for HLA-DRB1 and 9 RBC antigens. Whole-genome sequencing data was analyzed with software allowing phasing of variable positions to define alleles or haplotypes and validated for HLA typing from next-generation sequencing data. A dedicated database was set up with 1648 variable positions analyzed in KEL (KEL), ACKR1 (FY), SLC14A1 (JK), ACHE (YT), ART4 (DO), AQP1 (CO), CD44 (IN), SLC4A1 (DI) and ICAM4 (LW). Whole-genome sequencing typing was compared to that previously obtained by amplicon-based monoallelic sequencing and by SNaPshot analysis. Whole-genome sequencing data were also explored for other alleles. Our results showed 93% of concordance for blood group polymorphisms and 91% for HLA-DRB1. Incorrect typing and unresolved results confirm that WGS should be considered reliable with read depths strictly above 15x. Our results supported that RBC antigen typing from WGS is feasible but requires improvements in read depth for SNV polymorphisms typing accuracy. We also showed the potential for WGS in screening donors with rare blood antigens, such as weak JK alleles. The development of WGS analysis in immunogenetics laboratories would offer personalized care in the management of RBC disorders.
Project description:We present RNA sequencing data sets and their genome sequence assembly for dengue virus that was isolated from a patient with dengue hemorrhagic fever and serially propagated in Vero cells. RNA sequencing data obtained from the first, third, and fifth passages and their corresponding whole-genome sequences are provided in this work.
Project description:Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning 'unassigned' labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.
Project description:Campylobacter jejuni is a foodborne pathogen and an important contributor to gastroenteritis in humans. C. jejuni readily forms biofilms which may play a role in the transmission of the pathogen from animals to humans. Herein, we present RNA sequencing data investigating differential gene expression in biofilm and planktonic C. jejuni These data provide insight into pathways which may be important to biofilm formation in this organism.
Project description:Objectives:Systemic lupus erythematosus (SLE) is a heterogeneous autoimmune disease that is difficult to treat. There is currently no optimal stratification of patients with SLE, and thus, responses to available treatments are unpredictable. Here, we developed a new stratification scheme for patients with SLE, based on the computational analysis of patients' whole-blood transcriptomes. Methods:We applied machine learning approaches to RNA-sequencing (RNA-seq) data sets to stratify patients with SLE into four distinct clusters based on their gene expression profiles. A meta-analysis on three recently published whole-blood RNA-seq data sets was carried out, and an additional similar data set of 30 patients with SLE and 29 healthy donors was incorporated in this study; a total of 161 patients with SLE and 57 healthy donors were analysed. Results:Examination of SLE clusters, as opposed to unstratified SLE patients, revealed underappreciated differences in the pattern of expression of disease-related genes relative to clinical presentation. Moreover, gene signatures correlated with flare activity were successfully identified. Conclusion:Given that SLE disease heterogeneity is a key challenge hindering the design of optimal clinical trials and the adequate management of patients, our approach opens a new possible avenue addressing this limitation via a greater understanding of SLE heterogeneity in humans. Stratification of patients based on gene expression signatures may be a valuable strategy allowing the identification of separate molecular mechanisms underpinning disease in SLE. Further, this approach may have a use in understanding the variability in responsiveness to therapeutics, thereby improving the design of clinical trials and advancing personalised therapy.
Project description:Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.
Project description:BACKGROUND: Whole Exome Sequencing (WES) is one of the most used and cost-effective next generation technologies that allows sequencing of all nuclear exons. Off-target regions may be captured if they present high sequence similarity with baits. Bioinformatics tools have been optimized to retrieve a large amount of WES off-target mitochondrial DNA (mtDNA), by exploiting the aspecificity of probes, partially overlapping to Nuclear mitochondrial Sequences (NumtS). The 1000 Genomes project represents one of the widest resources to extract mtDNA sequences from WES data, considering the large effort the scientific community is undertaking to reconstruct human population history using mtDNA as marker, and the involvement of mtDNA in pathology. RESULTS: A previously published pipeline aimed at assembling mitochondrial genomes from off-target WES reads and further improved to detect insertions and deletions (indels) and heteroplasmy in a dataset of 1242 samples from the 1000 Genomes project, enabled to obtain a nearly complete mitochondrial genome from 943 samples (76% analyzed exomes). The robustness of our computational strategy was highlighted by the reduction of reads amount recognized as mitochondrial in the original annotation produced by the Consortium, due to NumtS filtering. CONCLUSIONS: To the best of our knowledge, this is likely the most extended population-scale mitochondrial genotyping in humans enriched with the estimation of heteroplasmies.