Dataset Information

Robust identification of molecular phenotypes using semi-supervised learning.

ABSTRACT:

Background

Modern molecular profiling techniques are yielding vast amounts of data from patient samples that could be utilized with machine learning methods to provide important biological insights and improvements in patient outcomes. Unsupervised methods have been successfully used to identify molecularly-defined disease subtypes. However, these approaches do not take advantage of potential additional clinical outcome information. Supervised methods can be implemented when training classes are apparent (e.g., responders or non-responders to treatment). However, training classes can be difficult to define when assessing relative benefit of one therapy over another using gold standard clinical endpoints, since it is often not clear how much benefit each individual patient receives.

Results

We introduce an iterative approach to binary classification tasks based on the simultaneous refinement of training class labels and classifiers towards self-consistency. As training labels are refined during the process, the method is well suited to cases where training class definitions are not obvious or noisy. Clinical data, including time-to-event endpoints, can be incorporated into the approach to enable the iterative refinement to identify molecular phenotypes associated with a particular clinical variable. Using synthetic data, we show how this approach can be used to increase the accuracy of identification of outcome-related phenotypes and their associated molecular attributes. Further, we demonstrate that the advantages of the method persist in real world genomic datasets, allowing the reliable identification of molecular phenotypes and estimation of their association with outcome that generalizes to validation datasets. We show that at convergence of the iterative refinement, there is a consistent incorporation of the molecular data into the classifier yielding the molecular phenotype and that this allows a robust identification of associated attributes and the underlying biological processes.

Conclusions

The consistent incorporation of the structure of the molecular data into the classifier helps to minimize overfitting and facilitates not only good generalization of classification and molecular phenotypes, but also reliable identification of biologically relevant features and elucidation of underlying biological processes.

SUBMITTER: Roder H

PROVIDER: S-EPMC6540576 | biostudies-literature | 2019 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Robust identification of molecular phenotypes using semi-supervised learning.

Roder Heinrich H Oliveira Carlos C Net Lelia L Linstid Benjamin B Tsypin Maxim M Roder Joanna J

BMC bioinformatics 20190528 1

<h4>Background</h4>Modern molecular profiling techniques are yielding vast amounts of data from patient samples that could be utilized with machine learning methods to provide important biological insights and improvements in patient outcomes. Unsupervised methods have been successfully used to identify molecularly-defined disease subtypes. However, these approaches do not take advantage of potential additional clinical outcome information. Supervised methods can be implemented when training cla ...[more]

PMID: 31138112

Similar Datasets

Project description:ObjectivesEarly identification of infection improves outcomes, but developing models for early identification requires determining infection status with manual chart review, limiting sample size. Therefore, we aimed to compare semi-supervised and transfer learning algorithms with algorithms based solely on manual chart review for identifying infection in hospitalized patients.Materials and methodsThis multicenter retrospective study of admissions to 6 hospitals included "gold-standard" labels of infection from manual chart review and "silver-standard" labels from nonchart-reviewed patients using the Sepsis-3 infection criteria based on antibiotic and culture orders. "Gold-standard" labeled admissions were randomly allocated to training (70%) and testing (30%) datasets. Using patient characteristics, vital signs, and laboratory data from the first 24 hours of admission, we derived deep learning and non-deep learning models using transfer learning and semi-supervised methods. Performance was compared in the gold-standard test set using discrimination and calibration metrics.ResultsThe study comprised 432 965 admissions, of which 2724 underwent chart review. In the test set, deep learning and non-deep learning approaches had similar discrimination (area under the receiver operating characteristic curve of 0.82). Semi-supervised and transfer learning approaches did not improve discrimination over models fit using only silver- or gold-standard data. Transfer learning had the best calibration (unreliability index P value: .997, Brier score: 0.173), followed by self-learning gradient boosted machine (P value: .67, Brier score: 0.170).DiscussionDeep learning and non-deep learning models performed similarly for identifying infection, as did models developed using Sepsis-3 and manual chart review labels.ConclusionIn a multicenter study of almost 3000 chart-reviewed patients, semi-supervised and transfer learning models showed similar performance for model discrimination as baseline XGBoost, while transfer learning improved calibration.

Project description:BackgroundSkin cancers are the most common malignancies diagnosed worldwide. While the early detection and treatment of pre-cancerous and cancerous skin lesions can dramatically improve outcomes, factors such as a global shortage of pathologists, increased workloads, and high rates of diagnostic discordance underscore the need for techniques that improve pathology workflows. Although AI models are now being used to classify lesions from whole slide images (WSIs), diagnostic performance rarely surpasses that of expert pathologists.ObjectivesThe objective of the present study was to create an AI model to detect and classify skin lesions with a higher degree of sensitivity than previously demonstrated, with potential to match and eventually surpass expert pathologists to improve clinical workflows.MethodsWe combined supervised learning (SL) with semi-supervised learning (SSL) to produce an end-to-end multi-level skin detection system that not only detects 5 main types of skin lesions with high sensitivity and specificity, but also subtypes, localizes, and provides margin status to evaluate the proximity of the lesion to non-epidermal margins. The Supervised Training Subset consisted of 2188 random WSIs collected by the PathologyWatch (PW) laboratory between 2013 and 2018, while the Weakly Supervised Subset consisted of 5161 WSIs from daily case specimens. The Validation Set consisted of 250 curated daily case WSIs obtained from the PW tissue archives and included 50 "mimickers". The Testing Set (3821 WSIs) was composed of non-curated daily case specimens collected from July 20, 2021 to August 20, 2021 from PW laboratories.ResultsThe performance characteristics of our AI model (i.e., Mihm) were assessed retrospectively by running the Testing Set through the Mihm Evaluation Pipeline. Our results show that the sensitivity of Mihm in classifying melanocytic lesions, basal cell carcinoma, and atypical squamous lesions, verruca vulgaris, and seborrheic keratosis was 98.91% (95% CI: 98.27%, 99.55%), 97.24% (95% CI: 96.15%, 98.33%), 95.26% (95% CI: 93.79%, 96.73%), 93.50% (95% CI: 89.14%, 97.86%), and 86.91% (95% CI: 82.13%, 91.69%), respectively. Additionally, our multi-level (i.e., patch-level, ROI-level, and WSI-level) detection algorithm includes a qualitative feature that subtypes lesions, an AI overlay in the front-end digital display that localizes diagnostic ROIs, and reports on margin status by detecting overlap between lesions and non-epidermal tissue margins.ConclusionsOur AI model, developed in collaboration with dermatopathologists, detects 5 skin lesion types with higher sensitivity than previously published AI models, and provides end users with information such as subtyping, localization, and margin status in a front-end digital display. Our end-to-end system has the potential to improve pathology workflows by increasing diagnostic accuracy, expediting the course of patient care, and ultimately improving patient outcomes.

Dataset Information

Robust identification of molecular phenotypes using semi-supervised learning.

Background

Results

Conclusions

Publications

Robust identification of molecular phenotypes using semi-supervised learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets