Project description:Although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. The limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. In this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase II promoters. In this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (SVM), are used to discriminate between promoter and non-promoter regions. The basic idea here is to use descriptors that do not depend on the primary DNA sequence and provide a clear distinction between promoter and non-promoter regions. The classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy >85% in both promoter and non-promoter identification. This approach correctly identified all 20 experimentally verified promoters of human chromosome 22. The high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as Lyapunov component stability and Tsallis entropy, and supervised machine-learning methods, such as SVMs, can be useful in the identification of pol II promoters.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:Effective in silico methods to predict protein corona compositions on engineered nanomaterials (ENMs) could help elucidate the biological outcomes of ENMs in biosystems without the need for conducting lengthy experiments for corona characterization. However, the physicochemical properties of ENMs, used as the descriptors in current modeling methods, are insufficient to represent the complex interactions between ENMs and proteins. Herein, we utilized the fluorescence change (FC) from fluorescamine labeling on a protein, with or without the presence of the ENM, as a novel descriptor of the ENM to build machine learning models for corona formation. FCs were significantly correlated with the abundance of the corresponding proteins in the corona on diverse classes of ENMs, including metal and metal oxides, nanocellulose, and 2D ENMs. Prediction models established by the random forest algorithm using FCs as the ENM descriptors showed better performance than the conventional descriptors, such as ENM size and surface charge, in the prediction of corona formation. Moreover, they were able to predict protein corona formation on ENMs with very heterogeneous properties. We believe this novel descriptor can improve in silico studies of corona formation, leading to a better understanding on the protein adsorption behaviors of diverse ENMs in different biological matrices. Such information is essential for gaining a comprehensive view of how ENMs interact with biological systems in ENM safety and sustainability assessments.
Project description:Early detection of severe asthma exacerbations through home monitoring data in patients with stable mild-to-moderate chronic asthma could help to timely adjust medication. We evaluated the potential of machine learning methods compared to a clinical rule and logistic regression to predict severe exacerbations. We used daily home monitoring data from two studies in asthma patients (development: n = 165 and validation: n = 101 patients). Two ML models (XGBoost, one class SVM) and a logistic regression model provided predictions based on peak expiratory flow and asthma symptoms. These models were compared with an asthma action plan rule. Severe exacerbations occurred in 0.2% of all daily measurements in the development (154/92,787 days) and validation cohorts (94/40,185 days). The AUC of the best performing XGBoost was 0.85 (0.82-0.87) and 0.88 (0.86-0.90) for logistic regression in the validation cohort. The XGBoost model provided overly extreme risk estimates, whereas the logistic regression underestimated predicted risks. Sensitivity and specificity were better overall for XGBoost and logistic regression compared to one class SVM and the clinical rule. We conclude that ML models did not beat logistic regression in predicting short-term severe asthma exacerbations based on home monitoring data. Clinical application remains challenging in settings with low event incidence and high false alarm rates with high sensitivity.
Project description:Supramolecular hydrogels derived from nucleosides have been gaining significant attention in the biomedical field due to their unique properties and excellent biocompatibility. However, a major challenge in this field is that there is no model for predicting whether nucleoside derivative will form a hydrogel. Here, we successfully develop a machine learning model to predict the hydrogel-forming ability of nucleoside derivatives. The optimal model with a 71% (95% Confidence Interval, 0.69-0.73) accuracy is established based on a dataset of 71 reported nucleoside derivatives. 24 molecules are selected via the optimal model external application and the hydrogel-forming ability is experimentally verified. Among these, two rarely reported cation-independent nucleoside hydrogels are found. Based on their self-assemble mechanisms, the cation-independent hydrogel is found to have potential applications in rapid visual detection of Ag+ and cysteine. Here, we show the machine learning model may provide a tool to predict nucleoside derivatives with hydrogel-forming ability.
Project description:A data-driven approach to simulate circular dichroism (CD) spectra is appealing for fast protein secondary structure determination, yet the challenge of predicting electric and magnetic transition dipole moments poses a substantial barrier for the goal. To address this problem, we designed a new machine learning (ML) protocol in which ordinary pure geometry-based descriptors are replaced with alternative embedded density descriptors and electric and magnetic transition dipole moments are successfully predicted with an accuracy comparable to first-principle calculation. The ML model is able to not only simulate protein CD spectra nearly 4 orders of magnitude faster than conventional first-principle simulation but also obtain CD spectra in good agreement with experiments. Finally, we predicted a series of CD spectra of the Trp-cage protein associated with continuous changes of protein configuration along its folding path, showing the potential of our ML model for supporting real-time CD spectroscopy study of protein dynamics.
Project description:Yield prediction for crops is essential information for food security. A high-throughput phenotyping platform (HTPP) generates the data of the complete life cycle of a plant. However, the data are rarely used for yield prediction because of the lack of quality image analysis methods, yield data associated with HTPP, and the time-series analysis method for yield prediction. To overcome limitations, this study employed multiple deep learning (DL) networks to extract high-quality HTTP data, establish an association between HTTP data and the yield performance of crops, and select essential time intervals using machine learning (ML). The images of Arabidopsis were taken 12 times under environmentally controlled HTPP over 23 days after sowing (DAS). First, the features from images were extracted using DL network U-Net with SE-ResXt101 encoder and divided into early (15-21 DAS) and late (∼21-23 DAS) pre-flowering developmental stages using the physiological characteristics of the Arabidopsis plant. Second, the late pre-flowering stage at 23 DAS can be predicted using the ML algorithm XGBoost, based only on a portion of the early pre-flowering stage (17-21 DAS). This was confirmed using an additional biological experiment (P < 0.01). Finally, the projected area (PA) was estimated into fresh weight (FW), and the correlation coefficient between FW and predicted FW was calculated as 0.85. This was the first study that analyzed time-series data to predict the FW of related but different developmental stages and predict the PA. The results of this study were informative and enabled the understanding of the FW of Arabidopsis or yield of leafy plants and total biomass consumed in vertical farming. Moreover, this study highlighted the reduction of time-series data for examining interesting traits and future application of time-series analysis in various HTPPs.
Project description:Oxidative stress has pervasive effects on cells but how they respond transcriptionally upon the initial insult is incompletely understood. We developed a nuclear walk-on assay that semi-globally quantifies nascent transcripts in promoter-proximal paused RNA polymerase II (Pol II). Using this assay in conjunction with ChIP-Seq, in vitro transcription, and a chromatin retention assay, we show that within a minute, hydrogen peroxide causes accumulation of Pol II near promoters and enhancers that can best be explained by a rapid decrease in termination. Some of the accumulated polymerases slowly move or 'creep' downstream. This second effect is correlated with and probably results from loss of NELF association and function. Notably, both effects were independent of DNA damage and ADP-ribosylation. Our results demonstrate the unexpected speed at which a global transcriptional response can occur. The findings provide strong support for the residence time of paused Pol II elongation complexes being much shorter than estimated from previous studies.
Project description:The enormous computational requirements and unsustainable resource consumption associated with massive parameters of large language models and large vision models have given rise to challenging issues. Here, we propose an interpretable 'small model' framework characterized by only a single core-neuron, i.e. the one-core-neuron system (OCNS), to significantly reduce the number of parameters while maintaining performance comparable to the existing 'large models' in time-series forecasting. With multiple delay feedback designed in this single neuron, our OCNS is able to convert one input feature vector/state into one-dimensional time-series/sequence, which is theoretically ensured to fully represent the states of the observed dynamical system. Leveraging the spatiotemporal information transformation, the OCNS shows excellent and robust performance in forecasting tasks, in particular for short-term high-dimensional systems. The results collectively demonstrate that the proposed OCNS with a single core neuron offers insights into constructing deep learning frameworks with a small model, presenting substantial potential as a new way for achieving efficient deep learning.
Project description:The theoretical prediction of drug-decorated nanoparticles (DDNPs) has become a very important task in medical applications. For the current paper, Perturbation Theory Machine Learning (PTML) models were built to predict the probability of different pairs of drugs and nanoparticles creating DDNP complexes with anti-glioblastoma activity. PTML models use the perturbations of molecular descriptors of drugs and nanoparticles as inputs in experimental conditions. The raw dataset was obtained by mixing the nanoparticle experimental data with drug assays from the ChEMBL database. Ten types of machine learning methods have been tested. Only 41 features have been selected for 855,129 drug-nanoparticle complexes. The best model was obtained with the Bagging classifier, an ensemble meta-estimator based on 20 decision trees, with an area under the receiver operating characteristic curve (AUROC) of 0.96, and an accuracy of 87% (test subset). This model could be useful for the virtual screening of nanoparticle-drug complexes in glioblastoma. All the calculations can be reproduced with the datasets and python scripts, which are freely available as a GitHub repository from authors.