Project description:Transcription factors read the genome, fundamentally connecting DNA sequence to gene expression across diverse cell types. Determining how, where, and when TFs bind chromatin will advance our understanding of gene regulatory networks and cellular behavior. The 2017 ENCODE-DREAM in vivo Transcription-Factor Binding Site (TFBS) Prediction Challenge highlighted the value of chromatin accessibility data to TFBS prediction, establishing state-of-the-art methods. Yet, while Assay-for-Transposase-Accessible-Chromatin (ATAC)-seq datasets grow exponentially, suboptimal motif scanning is commonly used for TFBS prediction from ATAC-seq. Here, we present “maxATAC”, a suite of user-friendly, deep neural network models for genome-wide TFBS prediction from ATAC-seq in any cell type. With models available for 127 human TFs, maxATAC is the largest collection of state-of-the-art TFBS models to date. maxATAC performance extends to primary cells and single-cell ATAC-seq, enabling state-of-the-art TFBS prediction in vivo. We demonstrate maxATAC’s capabilities by identifying TFBS associated with allele-dependent chromatin accessibility at atopic dermatitis genetic risk loci.
Project description:Pattern discovery algorithms are methods for discovering recurrent, non-random motifs widely used in the analysis of biological sequences. Many algorithms exist but few comparisons have been made amongst them. We systematically profile eight representative methods at multiple parameter settings across 174 diverse experimental datasets, including ten novel ChIP-on-chip datasets. We executed 16,777 pattern discovery analyses to assess prediction accuracy, CPU usage and memory consumption. For 144 datasets we developed a gold-standard using machine-learning algorithms; cross-validation was used for the remaining datasets. Performance was highly disparate, with median accuracy ranging from 32% to 96%. Importantly we were unable to replicate previously reported algorithm-rankings, emphasizing the need to use many and diverse experimental datasets. We found deterministic algorithms like Projection and Oligo/Dyad had the highest prediction accuracy. Computational efficiency was not linearly related to dataset size and becomes critical: some algorithms are intractably slow on large datasets. This work provides the first combined assessment of the CPU, memory, and prediction accuracies of pattern discovery algorithms on real experimental datasets.
2009-11-24 | GSE15370 | GEO
Project description:Performance of four modern whole genome amplification methods for copy number variant detection in single cells
Project description:Pattern discovery algorithms are methods for discovering recurrent, non-random motifs widely used in the analysis of biological sequences. Many algorithms exist but few comparisons have been made amongst them. We systematically profile eight representative methods at multiple parameter settings across 174 diverse experimental datasets, including ten novel ChIP-on-chip datasets. We executed 16,777 pattern discovery analyses to assess prediction accuracy, CPU usage and memory consumption. For 144 datasets we developed a gold-standard using machine-learning algorithms; cross-validation was used for the remaining datasets. Performance was highly disparate, with median accuracy ranging from 32% to 96%. Importantly we were unable to replicate previously reported algorithm-rankings, emphasizing the need to use many and diverse experimental datasets. We found deterministic algorithms like Projection and Oligo/Dyad had the highest prediction accuracy. Computational efficiency was not linearly related to dataset size and becomes critical: some algorithms are intractably slow on large datasets. This work provides the first combined assessment of the CPU, memory, and prediction accuracies of pattern discovery algorithms on real experimental datasets. HL60-Mnt-ChIP: ChIP-Chip with 10 biological replicates HL60-Trrap-ChIP: ChIP-Chip with 13 biological replicates
Project description:Synthetic lethality (SL) has shown great promise for the discovery of novel targets in cancer. CRISPR double-knockout (CDKO) technologies can only screen several hundred genes and their combinations, but not genome-wide. Therefore, good SL prediction models are highly needed for genes and gene pairs selection in CDKO experiments. In this paper, we develop a novel multi-layer encoder for individual sample-specific SL prediction (MLEC-iSL). Unlike existing SL prediction models, MLEC-iSL is built to predict SL connectivity first. Because SL connectivity is scalable from existing genes in the training data to new genes in validation data, we hypothesize MLEC-iSL has better SL prediction performance. MLEC-iSL has three encoders, namely gene encoder, graph encoder, and transformer encoder. MLEC-iSL has high performance in K562 (AUPR, 0.73; AUC, 0.72) and Jurkat (AUPR, 0.73; AUC, 0.71) cells while no existing methods exceed 0.62 AUPR and AUC in either cell. MLEC-iSL guided CDKO experiment in 22Rv1 cells yielded a 46.8% SL ratio amongst its selected gene pairs. Six of top ten SL connectivity hub genes are validated in 22Rv1 cells. It reveals SL gene pairs and dependency between apoptosis and mitosis cell death pathways.
Project description:During the current SARS-CoV-2 pandemic, a variety of mutations have been accumulated in the viral genome, and currently, four variants of concerns (VOCs) are considered as the hazardous SARS-CoV-2 variants to the human society. The newly emerging VOC, the B.1.617.2/Delta variant, closely associates with a huge COVID-19 surge in India in Spring 2021. However, its virological property remains unclear. Here, we show that the B.1.617.2/Delta variant is highly fusogenic, and notably, more pathogenic than prototypic SARS-CoV-2 in infected hamsters. The P681R mutation in the spike protein, which is highly conserved in this lineage, facilitates the spike protein cleavage and enhances viral fusogenicity. Moreover, we demonstrate that the P681R-bearing virus exhibits higher pathogenicity than the parental virus. Our data suggest that the P681R mutation is a hallmark that characterizes the virological phenotype of the B.1.617.2/Delta variant and is closely associated with enhanced pathogenicity.
Project description:The functional consequences of missense variants in disease genes are difficult to predict. We assessed if gene expression profiles could distinguish between BRCA1 or BRCA2 pathogenic truncating and missense mutation carriers and familial breast cancer cases whose disease was not attributable to BRCA1 or BRCA2 mutations (BRCAX cases). 72 cell lines from affected women in high-risk breast-ovarian families were assayed after exposure to ionising irradiation, including 23 BRCA1 carriers, 22 BRCA2 carriers, and 27 BRCAX individuals. A subset of 10 BRCAX individuals carried rare BRCA1/2 sequence variants considered to be of low clinical significance (LCS). BRCA1 and BRCA2 mutation carriers had similar expression profiles, with some subclustering of missense mutation carriers. The majority of BRCAX individuals formed a distinct cluster, but BRCAX individuals with LCS variants had expression profiles similar to BRCA1/2 mutation carriers. Gaussian Process Classifier predicted BRCA1, BRCA2 and BRCAX status with a maximum of 62% accuracy, and prediction accuracy decreased with inclusion of BRCAX samples carrying an LCS variant, and inclusion of pathogenic missense carriers. Similarly, prediction of mutation status with gene lists derived using Support Vector Machines was good for BRCAX samples without an LCS variant (82-94%), poor for BRCAX with an LCS (40-50%), and improved for pathogenic BRCA1/2 mutation carriers when the gene list used for prediction was appropriate to mutation effect being tested (71-100%). This study indicates that mutation effect, and presence of rare variants possibly associated with a low risk of cancer, must be considered in the development of array-based assays of variant pathogenicity. Keywords: cell type comparison, stress response
Project description:Gene expression in Streptomyces turgidiscabies pathogenicity island was studied after transferring cells into thaxtomin A inducing medium OBB using Agilent 60mer oligonucleotide array with probes designed for S.scabies and S.turgidiscabies. Gene expression study was used to confirm results of gene prediction.
Project description:Gene expression in Streptomyces turgidiscabies pathogenicity island was studied after transferring cells into thaxtomin A inducing medium OBB using Agilent 60mer oligonucleotide array with probes designed for S.scabies and S.turgidiscabies. Gene expression study was used to confirm results of gene prediction. Time course gene expression experiment with two replicates