Unknown

Dataset Information

0

Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites.


ABSTRACT: We report an approach to predict DNA specificity of the tetracycline repressor (TetR) family transcription regulators (TFRs). First, a genome sequence-based method was streamlined with quantitative P-values defined to filter out reliable predictions. Then, a framework was introduced to incorporate structural data and to train a statistical energy function to score the pairing between TFR and TFR binding site (TFBS) based on sequences. The predictions benchmarked against experiments, TFBSs for 29 out of 30 TFRs were correctly predicted by either the genome sequence-based or the statistical energy-based method. Using P-values or Z-scores as indicators, we estimate that 59.6% of TFRs are covered with relatively reliable predictions by at least one of the two methods, while only 28.7% are covered by the genome sequence-based method alone. Our approach predicts a large number of new TFBs which cannot be correctly retrieved from public databases such as FootprintDB. High-throughput experimental assays suggest that the statistical energy can model the TFBSs of a significant number of TFRs reliably. Thus the energy function may be applied to explore for new TFBSs in respective genomes. It is possible to extend our approach to other transcriptional factor families with sufficient structural information.

SUBMITTER: Long P 

PROVIDER: S-EPMC7736823 | biostudies-literature | 2020 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites.

Long Pengpeng P   Zhang Lu L   Huang Bin B   Chen Quan Q   Liu Haiyan H  

Nucleic acids research 20201201 22


We report an approach to predict DNA specificity of the tetracycline repressor (TetR) family transcription regulators (TFRs). First, a genome sequence-based method was streamlined with quantitative P-values defined to filter out reliable predictions. Then, a framework was introduced to incorporate structural data and to train a statistical energy function to score the pairing between TFR and TFR binding site (TFBS) based on sequences. The predictions benchmarked against experiments, TFBSs for 29  ...[more]

Similar Datasets

| S-EPMC2987836 | biostudies-literature
| S-EPMC2638147 | biostudies-literature
| S-EPMC2241927 | biostudies-literature
| S-EPMC8197256 | biostudies-literature
| S-EPMC2847756 | biostudies-literature
| S-EPMC3689293 | biostudies-literature
| S-EPMC5054711 | biostudies-literature
| S-EPMC3540023 | biostudies-literature
| S-EPMC9204005 | biostudies-literature
| S-EPMC4029220 | biostudies-literature