Dataset Information

A Pipeline for High-Throughput Concentration Response Modeling of Gene Expression for Toxicogenomics.

ABSTRACT: Cell-based assays are an attractive option to measure gene expression response to exposure, but the cost of whole-transcriptome RNA sequencing has been a barrier to the use of gene expression profiling for in vitro toxicity screening. In addition, standard RNA sequencing adds variability due to variable transcript length and amplification. Targeted probe-sequencing technologies such as TempO-Seq, with transcriptomic representation that can vary from hundreds of genes to the entire transcriptome, may reduce some components of variation. Analyses of high-throughput toxicogenomics data require renewed attention to read-calling algorithms and simplified dose-response modeling for datasets with relatively few samples. Using data from induced pluripotent stem cell-derived cardiomyocytes treated with chemicals at varying concentrations, we describe here and make available a pipeline for handling expression data generated by TempO-Seq to align reads, clean and normalize raw count data, identify differentially expressed genes, and calculate transcriptomic concentration-response points of departure. The methods are extensible to other forms of concentration-response gene-expression data, and we discuss the utility of the methods for assessing variation in susceptibility and the diseased cellular state.

SUBMITTER: House JS

PROVIDER: S-EPMC5672545 | biostudies-literature | 2017

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A Pipeline for High-Throughput Concentration Response Modeling of Gene Expression for Toxicogenomics.

House John S JS Grimm Fabian A FA Jima Dereje D DD Zhou Yi-Hui YH Rusyn Ivan I Wright Fred A FA

Frontiers in genetics 20171101

Cell-based assays are an attractive option to measure gene expression response to exposure, but the cost of whole-transcriptome RNA sequencing has been a barrier to the use of gene expression profiling for <i>in vitro</i> toxicity screening. In addition, standard RNA sequencing adds variability due to variable transcript length and amplification. Targeted probe-sequencing technologies such as TempO-Seq, with transcriptomic representation that can vary from hundreds of genes to the entire transcr ...[more]

PMID: 29163636

Similar Datasets

Project description:Efficient high-throughput transcriptomics (HTT) tools promise inexpensive, rapid assessment of possible biological consequences of human and environmental exposures to tens of thousands of chemicals in commerce. HTT systems have used relatively small sets of gene expression measurements coupled with mathematical prediction methods to estimate genome-wide gene expression and are often trained and validated using pharmaceutical compounds. It is unclear whether these training sets are suitable for general toxicity testing applications and the more diverse chemical space represented by commercial chemicals and environmental contaminants. In this work, we built predictive computational models that inferred whole genome transcriptional profiles from a smaller sample of surrogate genes. The model was trained and validated using a large scale toxicogenomics database with gene expression data from exposure to heterogeneous chemicals from a wide range of classes (the Open TG-GATEs data base). The method of predictor selection was designed to allow high fidelity gene prediction from any pre-existing gene expression data set, regardless of animal species or data measurement platform. Predictive qualitative models were developed with this TG-GATES data that contained gene expression data of human primary hepatocytes with over 941 samples covering 158 compounds. A sequential forward search-based greedy algorithm, combining different fitting approaches and machine learning techniques, was used to find an optimal set of surrogate genes that predicted differential expression changes of the remaining genome. We then used pathway enrichment of up-regulated and down-regulated genes to assess the ability of a limited gene set to determine relevant patterns of tissue response. In addition, we compared prediction performance using the surrogate genes found from our greedy algorithm (referred to as the SV2000) with the landmark genes provided by existing technologies such as L1000 (Genometry) and S1500 (Tox21), finding better predictive performance for the SV2000. The ability of these predictive algorithms to predict pathway level responses is a positive step toward incorporating mode of action (MOA) analysis into the high throughput prioritization and testing of the large number of chemicals in need of safety evaluation.

Project description:BackgroundSingle nucleotide polymorphisms (SNPs) as defined here are single base sequence changes or short insertion/deletions between or within individuals of a given species. As a result of their abundance and the availability of high throughput analysis technologies SNP markers have begun to replace other traditional markers such as restriction fragment length polymorphisms (RFLPs), amplified fragment length polymorphisms (AFLPs) and simple sequence repeats (SSRs or microsatellite) markers for fine mapping and association studies in several species. For SNP discovery from chromatogram data, several bioinformatics programs have to be combined to generate an analysis pipeline. Results have to be stored in a relational database to facilitate interrogation through queries or to generate data for further analyses such as determination of linkage disequilibrium and identification of common haplotypes. Although these tasks are routinely performed by several groups, an integrated open source SNP discovery pipeline that can be easily adapted by new groups interested in SNP marker development is currently unavailable.ResultsWe developed SNP-PHAGE (SNP discovery Pipeline with additional features for identification of common haplotypes within a sequence tagged site (Haplotype Analysis) and GenBank (-dbSNP) submissions. This tool was applied for analyzing sequence traces from diverse soybean genotypes to discover over 10,000 SNPs. This package was developed on UNIX/Linux platform, written in Perl and uses a MySQL database. Scripts to generate a user-friendly web interface are also provided with common queries for preliminary data analysis. A machine learning tool developed by this group for increasing the efficiency of SNP discovery is integrated as a part of this package as an optional feature. The SNP-PHAGE package is being made available open source at http://bfgl.anri.barc.usda.gov/ML/snp-phage/.ConclusionSNP-PHAGE provides a bioinformatics solution for high throughput SNP discovery, identification of common haplotypes within an amplicon, and GenBank (dbSNP) submissions. SNP selection and visualization are aided through a user-friendly web interface. This tool is useful for analyzing sequence tagged sites (STSs) of genomic sequences, and this software can serve as a starting point for groups interested in developing SNP markers.

Dataset Information

A Pipeline for High-Throughput Concentration Response Modeling of Gene Expression for Toxicogenomics.

Publications

A Pipeline for High-Throughput Concentration Response Modeling of Gene Expression for Toxicogenomics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets