Project description:Mutation effects prediction is a fundamental challenge in biotechnology and biomedicine. State-of-the-art computational methods have demonstrated the benefits of including semantically rich representations learned from protein sequences, but leave structural constraints out of reach. Here we developed Protein Mutational Effect Predictor (ProMEP), a general and multimodal deep representation learning method that simultaneously learns sequence context and structural constraints from proteins at the scale of evolution. ProMEP markedly outperforms current leading methods and enables accurate zero-shot mutational effects prediction across a variety of deep mutational scanning experiments. The application of ProMEP in the transposon-associated TnpB enzyme engineering task further demonstrates its ability for high-throughput protein space exploration. Without prior knowledge of TnpB, ProMEP accurately identifies multiple mutations that significantly improve the editing efficiency from millions of variants.
Project description:RNA-guided endonucleases form the crux of diverse biological processes and technologies, including adaptive immunity, transposition, and genome editing. Some of these enzymes are components of insertion sequences (IS) in the IS200/IS605 and IS607 transposon families. Both IS families encode a TnpA transposase and TnpB nuclease, an RNA-guided enzyme ancestral to CRISPR-Cas12. In eukaryotes and their viruses, TnpB homologs occur as two distinct types, Fanzor1 and Fanzor2. We analyzed the evolutionary relationships between prokaryotic TnpBs and eukaryotic Fanzors, revealing that a clade of IS607 TnpBs with unusual active site arrangement found primarily in cyanobacteria likely gave rise to both types of Fanzors. The widespread nature of Fanzors imply that the properties of this particular group of IS607 TnpBs were particularly suited to adaptation and evolution in eukaryotes and their viruses. Biochemical analysis of a prokaryotic IS607 TnpB and virally encoded Fanzor1s revealed features that may have fostered co-evolution between TnpBs/Fanzors and their cognate transposases. These results provide insight into the evolutionary origins of a ubiquitous family of RNA-guided proteins that shows remarkable conservation across the three domains of life.
Project description:We report the development of a deep learning-based tool, DeepTFactor, that predicts whether a protein of question is a transcription factor. DeepTFactor uses a convolutional neural network to extract features of protein sequences. We characterized the genome-wide binding sites of three TFs (i.e., YqhC, YiaU, and YahB), which are predicted by DeepTFactor
Project description:Insertion sequences (IS) are compact and pervasive transposable elements found in bacteria, which encode only the genes necessary for their mobilization and maintenance. IS200/IS605 elements undergo ‘peel-and-paste’ transposition catalyzed by a TnpA transposase, but intriguingly, they also encode diverse, TnpB-family genes that are evolutionarily related to the CRISPR-associated effectors Cas9 and Cas12. Recent studies demonstrated that TnpB-family enzymes function as RNA-guided DNA endonucleases, but the broader biological role of this activity has remained enigmatic. Here we show that IscB and TnpB are essential to prevent loss of the donor IS element and potential transposon extinction as a consequence of the TnpA transposition mechanism. We first performed phylogenetic analysis of IscB/TnpB proteins and selected a family of related IS elements from Geobacillus stearothermophilus that we predicted would be mobilized by a common TnpA homolog. After reconstituting transposition using a heterologous expression system in E. coli, we found that IS elements were readily lost from the donor site due to the activity of TnpA in rejoining the flanking sequences back together upon excision. However, these IS elements also encode non-coding RNAs that guide TnpB and IscB nucleases to precisely recognize and cleave these excision products, leading either to elimination of the excision product or re-installation of the transposon through recombination. Indeed, under experimental conditions in which TnpA and TnpB-RNA complexes were co-expressed together with a genomically integrated IS element, transposon retention was significantly increased relative to conditions expressing TnpA alone. Remarkably, both TnpA and TnpB recognize the same AT-rich transposon-adjacent motif (TAM) during transposon excision and RNA-guided DNA cleavage, respectively, revealing a striking convergence in the evolution of DNA sequence specificity between transposase and nuclease. Collectively, our study reveals that RNA-guided DNA cleavage is a primal biochemical activity that arose to bias the selfish inheritance of transposable elements, which was later co-opted during the evolution of CRISPR-Cas adaptive immunity for antiviral defense.
Project description:This study aims to predict the activity and specificity of CRISPR/Cas9 by deep learning at genome-scale among different cell lines. Here, we have focused on embracing and modifying a system for evaluating SpCas9 activity of on-target and off-target using >1,000,000 guide RNAs (gRNAs) covering ~20,000 protein-coding genes and ~10,000 non-coding genes in synthetic constructs with a high-throughput manner. With the help of deep learning algorithms in the field of artificial intelligence, three prediction models with the best generalization performance now are constructed: Aidit_Cas9-ON, Aidit_Cas9-OFF, and Aidit_Cas9-DSB. Moreover, through systematically investigating the influence of diverse cellular environment on gRNA activity and specificity, we noticed that distinct features are favored from H1 cell line compared with the other 2 cell lines for on-target activity and the overall distribution of repair outcomes is markedly different across 3 cell lines, especially in Jurkat. Finally, we identify a key effect protein DNTT strongly influences editing outcomes induced by CRISPR/Cas9. We confirm that this study will greatly facilitate CRISPR-based genome editing.
Project description:Many previous studies, including the Next Generation Sequencing (NGS)-based ones, have shown the critical roles of RNA editing in biomedicine. Direct RNA sequencing emerges as another powerful technique to advance the understanding of RNA editing by new paradigms, especially in single-molecule and long-range characterization. The urgent gap is the accurate and robust identification of RNA editing at the single-molecule and single-nucleotide resolution from direct RNA sequencing. This is challenging due to the inherent nature of the context-dependence on the raw signals, which requires enormous training data with considerable diversity. Here we propose two coupled measures to address them: 1) an abductive deep learning strategy implemented as the software ReDD fully utilizes the widely accessible NGS-based RNA editing data as indirect labels of direct RNA sequencing to achieve the detection at the single-molecule level; 2) a cloud-based platform Argo-ReDD serves as a central database for assembling large and diverse data from the community to continuously train the abductive deep learning model, which also meets the community demand of a user-friendly way to perform RNA editing analyses, such as co-occurrence analysis, quantitative analysis and gene isoform-resolved analysis, based on the specific information from direct RNA sequencing.