Project description:The RNA polymerase II core promoter is the site of convergence of the signals that lead to the initiation of transcription. Here, we perform a comparative analysis of the downstream core promoter region (DPR) in Drosophila and humans by using machine learning. These studies revealed a distinct human-specific version of the DPR and led to the use of the machine learning models for the identification of synthetic extreme DPR motifs with specificity for human transcription factors relative to Drosophila factors, and vice versa. More generally, machine learning models could be analogously used to design synthetic promoter elements with customized functional properties.
Project description:The RNA polymerase II (Pol II) core promoter is the strategic site of convergence of the signals that lead to the initiation of DNA transcription, but the downstream core promoter in humans has been difficult to understand. Here we analyse the human Pol II core promoter and use machine learning to generate predictive models for the downstream core promoter region (DPR) and the TATA box. We developed a method termed HARPE (high-throughput analysis of randomized promoter elements) to create hundreds of thousands of DPR (or TATA box) variants, each with known transcriptional strength. We then analysed the HARPE data by support vector regression (SVR) to provide comprehensive models for the sequence motifs, and found that the SVR-based approach is more effective than a consensus-based method for predicting transcriptional activity. These results show that the DPR is a functionally important core promoter element that is widely used in human promoters. Notably, there appears to be a duality between the DPR and the TATA box, as many promoters contain one or the other element. More broadly, these findings show that functional DNA motifs can be identified by machine learning analysis of a comprehensive set of sequence variants.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:Human induced pluripotent stem cells (iPSCs) were established as an artificial embryonic stem cells (ESCs) to avoid immune rejection, for ethical issues in regenerative medicine, and for biological research. Comparison analyses in previous studies revealed that there is no hot spot that distinguishes iPSCs from ESCs. We herewith established a learning model using Jubatus, as a machine learning platform, with linear model for classification to distinguish human iPSCs from ESCs based on DNA methylation profiles. We found that the linear model classification is most suitable for the analysis of human iPSCs whose line number is practically 10 to 100. The learning models discriminated ESCs and iPSCs with an accuracy of ≥ 85.71 % and ≥ 90.91 %, respectively. In addition, the epigenetic signature of iPSCs was identified by component analysis of the learning models. The iPSC-specific fluctuated methylation regions were abundant at chromosome 7, 8, 12, and 22. The method can be utilized with comprehensive data and can also be widely applied to many aspects of molecular biology research.
Project description:We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: (a) implication of three different normalization techniques, and (b) implication of differential analysis using the generalized linear model (GLM). We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.
Project description:We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: (a) implication of three different normalization techniques, and (b) implication of differential analysis using the generalized linear model (GLM). We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.