Project description:Single-cell RNA sequencing is essential for investigating cellular heterogeneity and highlighting cell subpopulation-specific signatures. Single-cell sequencing applications have spread from conventional RNA sequencing to epigenomics, e.g., ATAC-seq. Many related algorithms and tools have been developed, but few computational workflows provide analysis flexibility while also achieving functional (i.e., information about the data and the tools used are saved as metadata) and computational reproducibility (i.e., a real image of the computational environment used to generate the data is stored) through a user-friendly environment. rCASC is a modular workflow providing an integrated analysis environment (from count generation to cell subpopulation identification) exploiting Docker containerization to achieve both functional and computational reproducibility in data analysis. Hence, rCASC provides preprocessing tools to remove low-quality cells and/or specific bias, e.g., cell cycle. Subpopulation discovery can instead be achieved using different clustering techniques based on different distance metrics. Cluster quality is then estimated through the new metric "cell stability score" (CSS), which describes the stability of a cell in a cluster as a consequence of a perturbation induced by removing a random set of cells from the cell population. CSS provides better cluster robustness information than the silhouette metric. Moreover, rCASC's tools can identify cluster-specific gene signatures. rCASC is a modular workflow with new features that could help researchers define cell subpopulations and detect subpopulation-specific markers. It uses Docker for ease of installation and to achieve a computation-reproducible analysis. A Java GUI is provided to welcome users without computational skills in R.
Project description:Exponential rise of metagenomics sequencing is delivering massive functional environmental genomics data. However, this also generates a procedural bottleneck for on-going re-analysis as reference databases grow and methods improve, and analyses need be updated for consistency, which require acceess to increasingly demanding bioinformatic and computational resources. Here, we present the KAUST Metagenomic Analysis Platform (KMAP), a new integrated open web-based tool for the comprehensive exploration of shotgun metagenomic data. We illustrate the capacities KMAP provides through the re-assembly of ~ 27,000 public metagenomic samples captured in ~ 450 studies sampled across ~ 77 diverse habitats. A small subset of these metagenomic assemblies is used in this pilot study grouped into 36 new habitat-specific gene catalogs, all based on full-length (complete) genes. Extensive taxonomic and gene annotations are stored in Gene Information Tables (GITs), a simple tractable data integration format useful for analysis through command line or for database management. KMAP pilot study provides the exploration and comparison of microbial GITs across different habitats with over 275 million genes. KMAP access to data and analyses is available at https://www.cbrc.kaust.edu.sa/aamg/kmap.start .
Project description:ObjectiveRe-identification risk methods for biomedical data often assume a worst case, in which attackers know all identifiable features (eg, age and race) about a subject. Yet, worst-case adversarial modeling can overestimate risk and induce heavy editing of shared data. The objective of this study is to introduce a framework for assessing the risk considering the attacker's resources and capabilities.Materials and methodsWe integrate 3 established risk measures (ie, prosecutor, journalist, and marketer risks) and compute re-identification probabilities for data subjects. This probability is dependent on an attacker's capabilities (eg, ability to obtain external identified resources) and the subject's decision on whether to reveal their participation in a dataset. We illustrate the framework through case studies using data from over 1 000 000 patients from Vanderbilt University Medical Center and show how re-identification risk changes when attackers are pragmatic and use 2 known resources for attack: (1) voter registration lists and (2) social media posts.ResultsOur framework illustrates that the risk is substantially smaller in the pragmatic scenarios than in the worst case. Our experiments yield a median worst-case risk of 0.987 (where 0 is least risky and 1 is most risky); however, the median reduction in risk was 90.1% in the voter registration scenario and 100% in the social media posts scenario. Notably, these observations hold true for a wide range of adversarial capabilities.ConclusionsThis research illustrates that re-identification risk is situationally dependent and that appropriate adversarial modeling may permit biomedical data sharing on a wider scale than is currently the case.
Project description:Single cell transcriptomics has recently seen a surge in popularity, leading to the need for data analysis pipelines that are reproducible, modular, and interoperable across different systems and institutions.To meet this demand, we introduce scAN1.0, a processing pipeline for analyzing 10X single cell RNA sequencing data. scAN1.0 is built using the Nextflow DSL2 and can be run on most computational systems. The modular design of Nextflow pipelines enables easy integration and evaluation of different blocks for specific analysis steps.We demonstrate the usefulness of scAN1.0 by showing its ability to examine the impact of the mapping step during the analysis of two datasets: (i) a 10X scRNAseq of a human pituitary gonadotroph tumor dataset and (ii) a murine 10X scRNAseq acquired on CD8 T cells during an immune response.
Project description:Microfluidic cultivation devices that facilitate O2 control enable unique studies of the complex interplay between environmental O2 availability and microbial physiology at the single-cell level. Therefore, microbial single-cell analysis based on time-lapse microscopy is typically used to resolve microbial behavior at the single-cell level with spatiotemporal resolution. Time-lapse imaging then provides large image-data stacks that can be efficiently analyzed by deep learning analysis techniques, providing new insights into microbiology. This knowledge gain justifies the additional and often laborious microfluidic experiments. Obviously, the integration of on-chip O2 measurement and control during the already complex microfluidic cultivation, and the development of image analysis tools, can be a challenging endeavor. A comprehensive experimental approach to allow spatiotemporal single-cell analysis of living microorganisms under controlled O2 availability is presented here. To this end, a gas-permeable polydimethylsiloxane microfluidic cultivation chip and a low-cost 3D-printed mini-incubator were successfully used to control O2 availability inside microfluidic growth chambers during time-lapse microscopy. Dissolved O2 was monitored by imaging the fluorescence lifetime of the O2-sensitive dye RTDP using FLIM microscopy. The acquired image-data stacks from biological experiments containing phase contrast and fluorescence intensity data were analyzed using in-house developed and open-source image-analysis tools. The resulting oxygen concentration could be dynamically controlled between 0% and 100%. The system was experimentally tested by culturing and analyzing an E. coli strain expressing green fluorescent protein as an indirect intracellular oxygen indicator. The presented system allows for innovative microbiological research on microorganisms and microbial ecology with single-cell resolution.
Project description:Spatial information of tissues is an essential component to reach a holistic overview of gene expression mechanisms. The sequencing-based Spatial transcriptomics approach allows to spatially barcode the whole transcriptome of tissue sections using microarray glass slides. However, manual preparation of high-quality tissue sequencing libraries is time-consuming and subjected to technical variability. Here, we present an automated adaptation of the 10x Genomics Visium library construction on the widely used Agilent Bravo Liquid Handling Platform. Compared to the manual Visium library preparation, our automated approach reduces hands-on time by over 80% and provides higher throughput and robustness. Our automated Visium library preparation protocol provides a new strategy to standardize spatially resolved transcriptomics analysis of tissues at scale.
Project description:Cells have different intrinsic markers such as mechanical and electrical properties, which may be used as specific characteristics. Here, we present a microfluidic chip configured with two opposing optical fibers and four 3D electrodes for multiphysical parameter measurement. The chip leverages optical fibers to capture and stretch a single cell and uses 3D electrodes to achieve rotation of the single cell. According to the stretching deformation and rotation spectrum, the mechanical and dielectric properties can be extracted. We provided proof of concept by testing five types of cells (HeLa, A549, HepaRG, MCF7 and MCF10A) and determined five biophysical parameters, namely, shear modulus, steady-state viscosity, and relaxation time from the stretching deformation and area-specific membrane capacitance and cytoplasm conductivity from the rotation spectra. We showed the potential of the chip in cancer research by observing subtle changes in the cellular properties of transforming growth factor beta 1 (TGF-β1)-induced epithelial-mesenchymal transition (EMT) A549 cells. The new chip provides a microfluidic platform capable of multiparameter characterization of single cells, which can play an important role in the field of single-cell research.
Project description:BACKGROUND:A lack of reproducibility has been repeatedly criticized in computational research. High throughput sequencing (HTS) data analysis is a complex multi-step process. For most of the steps a range of bioinformatic tools is available and for most tools manifold parameters need to be set. Due to this complexity, HTS data analysis is particularly prone to reproducibility and consistency issues. We have defined four criteria that in our opinion ensure a minimal degree of reproducible research for HTS data analysis. A series of workflow management systems is available for assisting complex multi-step data analyses. However, to the best of our knowledge, none of the currently available work flow management systems satisfies all four criteria for reproducible HTS analysis. RESULTS:Here we present uap, a workflow management system dedicated to robust, consistent, and reproducible HTS data analysis. uap is optimized for the application to omics data, but can be easily extended to other complex analyses. It is available under the GNU GPL v3 license at https://github.com/yigbt/uap. CONCLUSIONS:uap is a freely available tool that enables researchers to easily adhere to reproducible research principles for HTS data analyses.
Project description:A modern biomedical research project can easily contain hundreds of analysis steps and lack of reproducibility of the analyses has been recognized as a severe issue. While thorough documentation enables reproducibility, the number of analysis programs used can be so large that in reality reproducibility cannot be easily achieved. Literate programming is an approach to present computer programs to human readers. The code is rearranged to follow the logic of the program, and to explain that logic in a natural language. The code executed by the computer is extracted from the literate source code. As such, literate programming is an ideal formalism for systematizing analysis steps in biomedical research. We have developed the reproducible computing tool Lir (literate, reproducible computing) that allows a tool-agnostic approach to biomedical data analysis. We demonstrate the utility of Lir by applying it to a case study. Our aim was to investigate the role of endosomal trafficking regulators to the progression of breast cancer. In this analysis, a variety of tools were combined to interpret the available data: a relational database, standard command-line tools, and a statistical computing environment. The analysis revealed that the lipid transport related genes LAPTM4B and NDRG1 are coamplified in breast cancer patients, and identified genes potentially cooperating with LAPTM4B in breast cancer progression. Our case study demonstrates that with Lir, an array of tools can be combined in the same data analysis to improve efficiency, reproducibility, and ease of understanding. Lir is an open-source software available at github.com/borisvassilev/lir.
Project description:One of the first steps in understanding a protein's function is to determine its localization; however, the methods for localizing proteins in some systems have not kept pace with the developments in other fields, creating a bottleneck in the analysis of the large datasets that are generated in the post-genomic era. To address this, we developed tools for tagging proteins in trypanosomatids. We made a plasmid that, when coupled with long primer PCR, can be used to produce transgenes at their endogenous loci encoding proteins tagged at either terminus or within the protein coding sequence. This system can also be used to generate deletion mutants to investigate the function of different protein domains. We show that the length of homology required for successful integration precluded long primer PCR tagging in Leishmania mexicana. Hence, we developed plasmids and a fusion PCR approach to create gene tagging amplicons with sufficiently long homologous regions for targeted integration, suitable for use in trypanosomatids with less efficient homologous recombination than Trypanosoma brucei. Importantly, we have automated the primer design, developed universal PCR conditions and optimized the workflow to make this system reliable, efficient and scalable such that whole genome tagging is now an achievable goal.