Unknown

Dataset Information

0

Accurate prediction of B-form/A-form DNA conformation propensity from primary sequence: A machine learning and free energy handshake.


ABSTRACT: DNA carries the genetic code of life, with different conformations associated with different biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. We have deployed a host of machine learning algorithms, including the popular state-of-the-art LightGBM (a gradient boosting model), for building prediction models. We used the nested cross-validation strategy to address the issues of "overfitting" and selection bias. This simultaneously provides an unbiased estimate of the generalization performance of a machine learning algorithm and allows us to tune the hyperparameters optimally. Furthermore, we built a secondary model based on SHAP (SHapley Additive exPlanations) that offers crucial insight into model interpretability. Our detailed model-building strategy and robust statistical validation protocols tackle the formidable challenge of working on small datasets, which is often the case in biological and medical data.

SUBMITTER: Gupta A 

PROVIDER: S-EPMC8441556 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC7782579 | biostudies-literature
| S-EPMC8325294 | biostudies-literature
| S-EPMC4100209 | biostudies-literature
| S-EPMC5465316 | biostudies-literature
| S-EPMC3631651 | biostudies-literature
| S-EPMC4638330 | biostudies-literature
| S-EPMC5419702 | biostudies-literature
| S-EPMC5561031 | biostudies-other
| S-EPMC9472276 | biostudies-literature
| S-EPMC8844416 | biostudies-literature