Dataset Information

Identifying gene expression-based biomarkers in online learning environments.

ABSTRACT:

Motivation

Gene expression-based classifiers are often developed using historical data by training a model on a small set of patients and a large set of features. Models trained in such a way can be afterwards applied for predicting the output for new unseen patient data. However, very often the accuracy of these models starts to decrease as soon as new data is fed into the trained model. This problem, known as concept drift, complicates the task of learning efficient biomarkers from data and requires special approaches, different from commonly used data mining techniques.

Results

Here, we propose an online ensemble learning method to continually validate and adjust gene expression-based biomarker panels over increasing volume of data. We also propose a computational solution to the problem of feature drift where gene expression signatures used to train the classifier become less relevant over time. A benchmark study was conducted to classify the breast tumors into known subtypes by using a large-scale transcriptomic dataset (∼3500 patients), which was obtained by combining two datasets: SCAN-B and TCGA-BRCA. Remarkably, the proposed strategy improves the classification performances of gold-standard biomarker panels (e.g. PAM50, OncotypeDX and Endopredict) by adding features that are clinically relevant. Moreover, test results show that newly discovered biomarker models can retain a high classification accuracy rate when changing the source generating the gene expression profiles.

Availability and implementation

github.com/UEFBiomedicalInformaticsLab/OnlineLearningBD.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

SUBMITTER: Cattelani L

PROVIDER: S-EPMC9710669 | biostudies-literature | 2022

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Identifying gene expression-based biomarkers in online learning environments.

Cattelani Luca L Fortino Vittorio V

Bioinformatics advances 20221013 1

<h4>Motivation</h4>Gene expression-based classifiers are often developed using historical data by training a model on a small set of patients and a large set of features. Models trained in such a way can be afterwards applied for predicting the output for new unseen patient data. However, very often the accuracy of these models starts to decrease as soon as new data is fed into the trained model. This problem, known as concept drift, complicates the task of learning efficient biomarkers from dat ...[more]

PMID: 36699355

Similar Datasets

Project description:ImportanceEnvironments associated with smoking increase a smoker's craving to smoke and may provoke lapses during a quit attempt. Identifying smoking risk environments from images of a smoker's daily life provides a basis for environment-based interventions.ObjectiveTo apply a deep learning approach to the clinically relevant identification of smoking environments among settings that smokers encounter in daily life.Design, setting, and participantsIn this cross-sectional study, 4902 images of smoking (n = 2457) and nonsmoking (n = 2445) locations were photographed by 169 smokers from Durham, North Carolina, and Pittsburgh, Pennsylvania, areas from 2010 to 2016. These images were used to develop a probabilistic classifier to predict the location type (smoking or nonsmoking location), thus relating objects and settings in daily environments to established smoking patterns. The classifier combines a deep convolutional neural network with an interpretable logistic regression model and was trained and evaluated via nested cross-validation with participant-wise partitions (ie, out-of-sample prediction). To contextualize model performance, images taken by 25 randomly selected participants were also classified by smoking cessation experts. As secondary validation, craving levels reported by participants when viewing unfamiliar environments were compared with the model's predictions. Data analysis was performed from September 2017 to May 2018.Main outcomes and measuresClassifier performance (accuracy and area under the receiver operating characteristic curve [AUC]), comparison with 4 smoking cessation experts, contribution of objects and settings to smoking environment status (standardized model coefficients), and correlation with participant-reported craving.ResultsOf 169 participants, 106 (62.7%) were from Durham (53 [50.0%] female; mean [SD] age, 41.4 [12.0] years) and 63 (37.3%) were from Pittsburgh (31 [51.7%] female; mean [SD] age, 35.2 [13.8] years). A total of 4902 images were available for analysis, including 3386 from Durham (mean [SD], 31.9 [1.3] images per participant) and 1516 from Pittsburgh (mean [SD], 24.1 [0.5] images per participant). Images were evenly split between the 2 classes, with 2457 smoking images (50.1%) and 2445 nonsmoking images (49.9%). The final model discriminated smoking vs nonsmoking environments with a mean (SD) AUC of 0.840 (0.024) (accuracy [SD], 76.5% [1.6%]). A model trained only with images from Durham participants effectively classified images from Pittsburgh participants (AUC, 0.757; accuracy, 69.2%), and a model trained only with images from Pittsburgh participants effectively classified images from Durham participants (AUC, 0.821; accuracy, 75.0%), suggesting good generalizability between geographic areas. Only 1 expert's performance was a statistically significant improvement compared with the classifier (α = .05). Median self-reported craving was significantly correlated with model-predicted smoking environment status (ρ = 0.894; P = .003).Conclusions and relevanceIn this study, features of daily environments predicted smoking vs nonsmoking status consistently across participants. The findings suggest that a deep learning approach can identify environments associated with smoking, can predict the probability that any image of daily life represents a smoking environment, and can potentially trigger environment-based interventions. This work demonstrates a framework for predicting how daily environments may influence target behaviors or symptoms that may have broad applications in mental and physical health.

Dataset Information

Identifying gene expression-based biomarkers in online learning environments.

Motivation

Results

Availability and implementation

Supplementary information

Publications

Identifying gene expression-based biomarkers in online learning environments.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets