Unknown

Dataset Information

0

BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT.


ABSTRACT: DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map genome-wide 5mC sites, these experimental methods are both costly and time-consuming. To achieve a fast and accurate prediction of 5mC sites, we propose a new computational approach, BERT-5mC. First, we pre-trained a domain-specific BERT (bidirectional encoder representations from transformers) model by using human promoter sequences as language corpus. BERT is a deep two-way language representation model based on Transformer. Second, we fine-tuned the domain-specific BERT model based on the 5mC training dataset to build the model. The cross-validation results show that our model achieves an AUROC of 0.966 which is higher than other state-of-the-art methods such as iPromoter-5mC, 5mC_Pred, and BiLSTM-5mC. Furthermore, our model was evaluated on the independent test set, which shows that our model achieves an AUROC of 0.966 that is also higher than other state-of-the-art methods. Moreover, we analyzed the attention weights generated by BERT to identify a number of nucleotide distributions that are closely associated with 5mC modifications. To facilitate the use of our model, we built a webserver which can be freely accessed at: http://5mc-pred.zhulab.org.cn.

SUBMITTER: Wang S 

PROVIDER: S-EPMC10712318 | biostudies-literature | 2023

REPOSITORIES: biostudies-literature

altmetric image

Publications

BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT.

Wang Shuyu S   Liu Yinbo Y   Liu Yufeng Y   Zhang Yong Y   Zhu Xiaolei X  

PeerJ 20231208


DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map genome-wide 5mC sites, these experimental methods are both costly and time-consuming. To achieve a fast and accurate prediction of 5mC sites, we propose a new computational approach, BERT-5mC. First, we  ...[more]

Similar Datasets

| S-EPMC11066948 | biostudies-literature
| S-EPMC8599298 | biostudies-literature
| S-EPMC9241225 | biostudies-literature
| S-EPMC10981123 | biostudies-literature
| S-EPMC6251864 | biostudies-literature
| S-EPMC9580886 | biostudies-literature
| S-EPMC9793967 | biostudies-literature
| S-EPMC8776474 | biostudies-literature
| S-EPMC7067889 | biostudies-literature
| S-EPMC11310455 | biostudies-literature