Project description:The design of CRISPR gRNAs requires accurate on-target efficiency predictions, which demand high-quality gRNA activity data and efficient modeling. To advance, we here report on the generation of on-target gRNA activity data for 10,592 SpCas9 gRNAs. Integrating these with complementary published data, we train a deep learning model, CRISPRon, on 23,902 gRNAs. Compared to existing tools, CRISPRon exhibits significantly higher prediction performances on four test datasets not overlapping with training data used for the development of these tools. Furthermore, we present an interactive gRNA design webserver based on the CRISPRon standalone software, both available via https://rth.dk/resources/crispr/ . CRISPRon advances CRISPR applications by providing more accurate gRNA efficiency predictions than the existing tools.
Project description:Genome editing with CRISPR-Cas nucleases has been applied successfully to a wide range of cells and organisms. There is, however, considerable variation in the efficiency of cleavage and outcomes at different genomic targets, even within the same cell type. Some of this variability is likely due to the inherent quality of the interaction between the guide RNA and the target sequence, but some may also reflect the relative accessibility of the target. We investigated the influence of chromatin structure, particularly the presence or absence of nucleosomes, on cleavage by the Streptococcus pyogenes Cas9 protein. At multiple target sequences in two promoters in the yeast genome, we find that Cas9 cleavage is strongly inhibited when the DNA target is within a nucleosome. This inhibition is relieved when nucleosomes are depleted. Remarkably, the same is not true of zinc-finger nucleases (ZFNs), which cleave equally well at nucleosome-occupied and nucleosome-depleted sites. These results have implications for the choice of specific targets for genome editing, both in research and in clinical and other practical applications.
Project description:Cas9 is an RNA-guided DNA endonuclease that targets foreign DNA for destruction as part of a bacterial adaptive immune system mediated by clustered regularly interspaced short palindromic repeats (CRISPR). Together with single-guide RNAs, Cas9 also functions as a powerful genome engineering tool in plants and animals, and efforts are underway to increase the efficiency and specificity of DNA targeting for potential therapeutic applications. Studies of off-target effects have shown that DNA binding is far more promiscuous than DNA cleavage, yet the molecular cues that govern strand scission have not been elucidated. Here we show that the conformational state of the HNH nuclease domain directly controls DNA cleavage activity. Using intramolecular Förster resonance energy transfer experiments to detect relative orientations of the Cas9 catalytic domains when associated with on- and off-target DNA, we find that DNA cleavage efficiencies scale with the extent to which the HNH domain samples an activated conformation. We furthermore uncover a surprising mode of allosteric communication that ensures concerted firing of both Cas9 nuclease domains. Our results highlight a proofreading mechanism beyond initial protospacer adjacent motif (PAM) recognition and RNA-DNA base-pairing that serves as a final specificity checkpoint before DNA double-strand break formation.
Project description:CRISPR Cas-9 is a groundbreaking genome-editing tool that harnesses bacterial defense systems to alter DNA sequences accurately. This innovative technology holds vast promise in multiple domains like biotechnology, agriculture and medicine. However, such power does not come without its own peril, and one such issue is the potential for unintended modifications (Off-Target), which highlights the need for accurate prediction and mitigation strategies. Though previous studies have demonstrated improvement in Off-Target prediction capability with the application of deep learning, they often struggle with the precision-recall trade-off, limiting their effectiveness and do not provide proper interpretation of the complex decision-making process of their models. To address these limitations, we have thoroughly explored deep learning networks, particularly the recurrent neural network based models, leveraging their established success in handling sequence data. Furthermore, we have employed genetic algorithm for hyperparameter tuning to optimize these models' performance. The results from our experiments demonstrate significant performance improvement compared with the current state-of-the-art in Off-Target prediction, highlighting the efficacy of our approach. Furthermore, leveraging the power of the integrated gradient method, we make an effort to interpret our models resulting in a detailed analysis and understanding of the underlying factors that contribute to Off-Target predictions, in particular the presence of two sub-regions in the seed region of single guide RNA which extends the established biological hypothesis of Off-Target effects. To the best of our knowledge, our model can be considered as the first model combining high efficacy, interpretability and a desirable balance between precision and recall.
Project description:Transcriptome engineering applications in living cells with RNA-targeting CRISPR effectors depend on accurate prediction of on-target activity and off-target avoidance. Here, we design and test ~200,000 RfxCas13d guide RNAs targeting essential genes in human cells with systematically-designed mismatches, insertions and deletions (indels). We find that mismatches and indels have a position- and context-dependent impact on Cas13d activity, and mismatches that result in G:U wobble pairings are better tolerated than other single-base mismatches. Using this large-scale dataset, we train a convolutional neural network that we term TIGER (Targeted Inhibition of Gene Expression via gRNA design) to predict efficacy from guide sequence and context. TIGER outperforms existing models at predicting on- and off-target activity on our dataset and published datasets. We show that TIGER scoring combined with specific mismatches yields the first general framework to modulate transcript expression, enabling use of RNA-targeting CRISPRs to precisely control gene dosage.
Project description:MotivationCRISPR/Cas9 technology has been revolutionizing the field of gene editing in recent years. Guide RNAs (gRNAs) enable Cas9 proteins to target specific genomic loci for editing. However, editing efficiency varies between gRNAs. Thus, computational methods were developed to predict editing efficiency for any gRNA of interest. High-throughput datasets of Cas9 editing efficiencies were produced to train machine-learning models to predict editing efficiency. However, these high-throughput datasets have low correlation with functional and endogenous editing. Another difficulty arises from the fact that functional and endogenous editing efficiency is more difficult to measure, and as a result, functional and endogenous datasets are too small to train accurate machine-learning models on.ResultsWe developed DeepCRISTL, a deep-learning model to predict the on-target efficiency given a gRNA sequence. DeepCRISTL takes advantage of high-throughput datasets to learn general patterns of gRNA on-target editing efficiency, and then uses transfer learning (TL) to fine-tune the model and fit it to the functional and endogenous prediction task. We pre-trained the DeepCRISTL model on more than 150 000 gRNAs, produced through the DeepHF study as a high-throughput dataset of three Cas9 enzymes. We improved the DeepHF model by multi-task and ensemble techniques and achieved state-of-the-art results over each of the three enzymes: up to 0.89 in Spearman correlation between predicted and measured on-target efficiencies. To fine-tune model weights to predict on-target efficiency of functional or endogenous datasets, we tested several TL approaches, with gradual learning being the overall best performer, both when pre-trained on DeepHF and when pre-trained on CRISPROn, another high-throughput dataset. DeepCRISTL outperformed state-of-the-art methods on all functional and endogenous datasets. Using saliency maps, we identified and compared the important features learned by the model in each dataset. We believe DeepCRISTL will improve prediction performance in many other CRISPR/Cas9 editing contexts by leveraging TL to utilize both high-throughput datasets, and smaller and more biologically relevant datasets, such as functional and endogenous datasets.Availability and implementationDeepCRISTL is available via github.com/OrensteinLab/DeepCRISTL.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:MotivationCRISPR/Cas9 technology has been revolutionizing the field of gene editing. Guide RNAs (gRNAs) enable Cas9 proteins to target specific genomic loci for editing. However, editing efficiency varies between gRNAs and so computational methods were developed to predict editing efficiency for any gRNA of interest. High-throughput datasets of Cas9 editing efficiencies were produced to train machine-learning models to predict editing efficiency. However, these high-throughput datasets have a low correlation with functional and endogenous datasets, which are too small to train accurate machine-learning models on.ResultsWe developed DeepCRISTL, a deep-learning model to predict the editing efficiency in a specific cellular context. DeepCRISTL takes advantage of high-throughput datasets to learn general patterns of gRNA editing efficiency and then fine-tunes the model on functional or endogenous data to fit a specific cellular context. We tested two state-of-the-art models trained on high-throughput datasets for editing efficiency prediction, our newly improved DeepHF and CRISPRon, combined with various transfer-learning approaches. The combination of CRISPRon and fine-tuning all model weights was the overall best performer. DeepCRISTL outperformed state-of-the-art methods in predicting editing efficiency in a specific cellular context on functional and endogenous datasets. Using saliency maps, we identified and compared the important features learned by DeepCRISTL across cellular contexts. We believe DeepCRISTL will improve prediction performance in many other CRISPR/Cas9 editing contexts by leveraging transfer learning to utilize both high-throughput datasets and smaller and more biologically relevant datasets.Availability and implementationDeepCRISTL is available via https://github.com/OrensteinLab/DeepCRISTL.
Project description:The clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) system has become a successful and promising technology for gene-editing. To facilitate its effective application, various computational tools have been developed. These tools can assist researchers in the guide RNA (gRNA) design process by predicting cleavage efficiency and specificity and excluding undesirable targets. However, while many tools are available, assessment of their application scenarios and performance benchmarks are limited. Moreover, new deep learning tools have been explored lately for gRNA efficiency prediction, but have not been systematically evaluated. Here, we discuss the approaches that pertain to the on-target activity problem, focusing mainly on the features and computational methods they utilize. Furthermore, we evaluate these tools on independent datasets and give some suggestions for their usage. We conclude with some challenges and perspectives about future directions for CRISPR-Cas9 guide design.
Project description:The clustered regularly interspaced short palindromic repeat (CRISPR)-associated enzyme Cas9 is an RNA-guided nuclease that has been widely adapted for genome editing in eukaryotic cells. However, the in vivo target specificity of Cas9 is poorly understood and most studies rely on in silico predictions to define the potential off-target editing spectrum. Using chromatin immunoprecipitation followed by sequencing (ChIP-seq), we delineate the genome-wide binding panorama of catalytically inactive Cas9 directed by two different single guide (sg) RNAs targeting the Trp53 locus. Cas9:sgRNA complexes are able to load onto multiple sites with short seed regions adjacent to (5')NGG(3') protospacer adjacent motifs (PAM). Yet among 43 ChIP-seq sites harboring seed regions analyzed for mutational status, we find editing only at the intended on-target locus and one off-target site. In vitro analysis of target site recognition revealed that interactions between the 5' end of the guide and PAM-distal target sequences are necessary to efficiently engage Cas9 nucleolytic activity, providing an explanation for why off-target editing is significantly lower than expected from ChIP-seq data.
Project description:CRISPR-Cas9 system is one of the recent most used genome editing techniques. Despite having a high capacity to alter the precise target genes and genomic regions that the planned guide RNA (or sgRNA) complements, the off-target effect still exists. But there are already machine learning algorithms for people, animals, and a few plant species. In this paper, an effort has been made to create models based on three machine learning-based techniques [namely, artificial neural networks (ANN), support vector machines (SVM), and random forests (RF)] for the prediction of the CRISPR-Cas9 cleavage sites that will be cleaved by a particular sgRNA. The plant dataset was the sole source of inspiration for all of these machine learning-based algorithms. 70% of the on-target and off-target dataset of various plant species that was gathered was used to train the models. The remaining 30% of the data set was used to evaluate the model’s performance using a variety of evaluation metrics, including specificity, sensitivity, accuracy, precision, F1 score, F2 score, and AUC. Based on the aforementioned machine learning techniques, eleven models in all were developed. Comparative analysis of these produced models suggests that the model based on the random forest technique performs better. The accuracy of the Random Forest model is 96.27%, while the AUC value was found to be 99.21%. The SVM-Linear, SVM-Polynomial, SVM-Gaussian, and SVM-Sigmoid models were trained, making a total of six ANN-based models (ANN1-Logistic, ANN1-Tanh, ANN1-ReLU, ANN2-Logistic, ANN2-Tanh, and ANN-ReLU) and Support Vector Machine models (SVM-Linear, SVM-Polynomial, SVM-Gaussian However, the overall performance of Random Forest is better among all other ML techniques. ANN1-ReLU and SVM-Linear model performance were shown to be better among Artificial Neural Network and Support Vector Machine-based models, respectively.