Project description:Motivation:The rapid accumulation of both sequence and phenotype data generated by high-throughput methods has increased the need to store and analyze data on distributed storage and computing systems. Efficient data management across these heterogeneous systems requires a workflow management system to simplify the task of analysis through automation and make large-scale bioinformatics analyses accessible and reproducible. Results:We developed SciApps, a web-based platform for reproducible bioinformatics workflows. The platform is designed to automate the execution of modular Agave apps and support execution of workflows on local clusters or in a cloud. Two workflows, one for association and one for annotation, are provided as exemplar scientific use cases. Availability and implementation:https://www.sciapps.org. Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:A key aspect of neuroscience research is the development of powerful, general-purpose data analyses that process large datasets. Unfortunately, modern data analyses have a hidden dependence upon complex computing infrastructure (e.g., software and hardware), which acts as an unaddressed deterrent to analysis users. Although existing analyses are increasingly shared as open-source software, the infrastructure and knowledge needed to deploy these analyses efficiently still pose significant barriers to use. In this work, we develop Neuroscience Cloud Analysis As a Service (NeuroCAAS): a fully automated open-source analysis platform offering automatic infrastructure reproducibility for any data analysis. We show how NeuroCAAS supports the design of simpler, more powerful data analyses and that many popular data analysis tools offered through NeuroCAAS outperform counterparts on typical infrastructure. Pairing rigorous infrastructure management with cloud resources, NeuroCAAS dramatically accelerates the dissemination and use of new data analyses for neuroscientific discovery.
Project description:Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power.
Project description:Quantifying changes in DNA and RNA levels is essential in numerous molecular biology protocols. Quantitative real time PCR (qPCR) techniques have evolved to become commonplace, however, data analysis includes many time-consuming and cumbersome steps, which can lead to mistakes and misinterpretation of data. To address these bottlenecks, we have developed an open-source Python software to automate processing of result spreadsheets from qPCR machines, employing calculations usually performed manually. Auto-qPCR is a tool that saves time when computing qPCR data, helping to ensure reproducibility of qPCR experiment analyses. Our web-based app (https://auto-q-pcr.com/) is easy to use and does not require programming knowledge or software installation. Using Auto-qPCR, we provide examples of data treatment, display and statistical analyses for four different data processing modes within one program: (1) DNA quantification to identify genomic deletion or duplication events; (2) assessment of gene expression levels using an absolute model, and relative quantification (3) with or (4) without a reference sample. Our open access Auto-qPCR software saves the time of manual data analysis and provides a more systematic workflow, minimizing the risk of errors. Our program constitutes a new tool that can be incorporated into bioinformatic and molecular biology pipelines in clinical and research labs.
Project description:We present Knowledge Engine for Genomics (KnowEnG), a free-to-use computational system for analysis of genomics data sets, designed to accelerate biomedical discovery. It includes tools for popular bioinformatics tasks such as gene prioritization, sample clustering, gene set analysis, and expression signature analysis. The system specializes in "knowledge-guided" data mining and machine learning algorithms, in which user-provided data are analyzed in light of prior information about genes, aggregated from numerous knowledge bases and encoded in a massive "Knowledge Network." KnowEnG adheres to "FAIR" principles (findable, accessible, interoperable, and reuseable): its tools are easily portable to diverse computing environments, run on the cloud for scalable and cost-effective execution, and are interoperable with other computing platforms. The analysis tools are made available through multiple access modes, including a web portal with specialized visualization modules. We demonstrate the KnowEnG system's potential value in democratization of advanced tools for the modern genomics era through several case studies that use its tools to recreate and expand upon the published analysis of cancer data sets.
Project description:BACKGROUND:Bisulfite sequencing allows base-pair resolution profiling of DNA methylation and has recently been adapted for use in single-cells. Analyzing these data, including making comparisons with existing data, remains challenging due to the scale of the data and differences in preprocessing methods between published datasets. RESULTS:We present a set of preprocessing pipelines for bisulfite sequencing DNA methylation data that include a new R/Bioconductor package, scmeth, for a series of efficient QC analyses of large datasets. The pipelines go from raw data to CpG-level methylation estimates and can be run, with identical results, either on a single computer, in an HPC cluster or on Google Cloud Compute resources. These pipelines are designed to allow users to 1) ensure reproducibility of analyses, 2) achieve scalability to large whole genome datasets with 100 GB+ of raw data per sample and to single-cell datasets with thousands of cells, 3) enable integration and comparison between user-provided data and publicly available data, as all samples can be processed through the same pipeline, and 4) access to best-practice analysis pipelines. Pipelines are provided for whole genome bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS) and hybrid selection (capture) bisulfite sequencing (HSBS). CONCLUSIONS:The workflows produce data quality metrics, visualization tracks, and aggregated output for further downstream analysis. Optional use of cloud computing resources facilitates analysis of large datasets, and integration with existing methylome profiles. The workflow design principles are applicable to other genomic data types.
Project description:Challenges are achieving broad acceptance for addressing many biomedical questions and enabling tool assessment. But ensuring that the methods evaluated are reproducible and reusable is complicated by the diversity of software architectures, input and output file formats, and computing environments. To mitigate these problems, some challenges have leveraged new virtualization and compute methods, requiring participants to submit cloud-ready software packages. We review recent data challenges with innovative approaches to model reproducibility and data sharing, and outline key lessons for improving quantitative biomedical data analysis through crowd-sourced benchmarking challenges.
Project description:Personalized multi-peptide vaccines are currently being discussed intensively for tumor immunotherapy. In order to find epitopes - short, immunogenic peptides - suitable to elicit an immune response, human leukocyte antigen-presented peptides from cancer tissue samples are purified using immunoaffinity purification and analyzed by high performance liquid chromatography coupled to mass spectrometry. Here we report on a novel computational pipeline to identify peptides from large-scale immunopeptidomics raw data sets. In the conducted experiments we benchmarked our workflow to other existing mass spectrometry analysis software and achieved higher sensitivity. A dataset of 38 HLA immunopeptidomics raw files of peripheral blood mononuclear cells (PBMCs) from 10 healthy volunteers and 4 JY cell lines was used to assess the performance of the pipeline at each processing step. In addition, 66 isotope labeled known HLA-presented peptides were spiked into the JY cell extracts decreasing in concentration by log10 steps from 100 fmol to 0.1 fmol.
Project description:Automated quantitative image analysis is essential for all fields of life science research. Although several software programs and algorithms have been developed for bioimage processing, an advanced knowledge of image processing techniques and high-performance computing resources are required to use them. Hence, we developed a cloud-based image analysis platform called IMACEL, which comprises morphological analysis and machine learning-based image classification. The unique click-based user interface of IMACEL's morphological analysis platform enables researchers with limited resources to evaluate particles rapidly and quantitatively without prior knowledge of image processing. Because all the image processing and machine learning algorithms are performed on high-performance virtual machines, users can access the same analytical environment from anywhere. A validation study of the morphological analysis and image classification of IMACEL was performed. The results indicate that this platform is an accessible and potentially powerful tool for the quantitative evaluation of bioimages that will lower the barriers to life science research.
Project description:The increase in bioinformatics resources such as tools/scripts and databases poses a great challenge for users seeking to construct interactive and reproducible biological data analysis applications. Here, we propose an open-source, comprehensive, flexible R package named BioInstaller that consists of the R functions, Shiny application, the HTTP representational state transfer application programming interfaces, and a docker image. BioInstaller can be used to collect, manage and share various types of bioinformatics resources and perform interactive and reproducible data analyses based on the extendible Shiny application with Tom's Obvious, Minimal Language and SQLite format databases. The source code of BioInstaller is freely available at our lab website, http://bioinfo.rjh.com.cn/labs/jhuang/tools/bioinstaller, the popular package host GitHub, https://github.com/JhuangLab/BioInstaller, and the Comprehensive R Archive Network, https://CRAN.R-project.org/package=BioInstaller. In addition, a docker image can be downloaded from DockerHub (https://hub.docker.com/r/bioinstaller/bioinstaller).