Project description:We developed a semi-supervised deep learning framework for the identification of doublets in scRNA-seq analysis called Solo. To validate our method, we used MULTI-seq, cholesterol modified oligos (CMOs), to experimentally identify doublets in a solid tissue with diverse cell types, mouse kidney, and showed Solo recapitulated experimentally identified doublets.
Project description:In data-driven phenotyping, a core computational task is to identify medical concepts and their variations from sources of electronic health records (EHR) to stratify phenotypic cohorts. A conventional analytic framework for phenotyping largely uses a manual knowledge engineering approach or a supervised learning approach where clinical cases are represented by variables encompassing diagnoses, medicinal treatments and laboratory tests, among others. In such a framework, tasks associated with feature engineering and data annotation remain a tedious and expensive exercise, resulting in poor scalability. In addition, certain clinical conditions, such as those that are rare and acute in nature, may never accumulate sufficient data over time, which poses a challenge to establishing accurate and informative statistical models. In this paper, we use infectious diseases as the domain of study to demonstrate a hierarchical learning method based on ensemble learning that attempts to address these issues through feature abstraction. We use a sparse annotation set to train and evaluate many phenotypes at once, which we call bulk learning. In this batch-phenotyping framework, disease cohort definitions can be learned from within the abstract feature space established by using multiple diseases as a substrate and diagnostic codes as surrogates. In particular, using surrogate labels for model training renders possible its subsequent evaluation using only a sparse annotated sample. Moreover, statistical models can be trained and evaluated, using the same sparse annotation, from within the abstract feature space of low dimensionality that encapsulates the shared clinical traits of these target diseases, collectively referred to as the bulk learning set.
Project description:ObjectiveElectronic Health Record (EHR) based phenotyping is a crucial yet challenging problem in the biomedical field. Though clinicians typically determine patient-level diagnoses via manual chart review, the sheer volume and heterogeneity of EHR data renders such tasks challenging, time-consuming, and prohibitively expensive, thus leading to a scarcity of clinical annotations in EHRs. Weakly supervised learning algorithms have been successfully applied to various EHR phenotyping problems, due to their ability to leverage information from large quantities of unlabeled samples to better inform predictions based on a far smaller number of patients. However, most weakly supervised methods are subject to the challenge to choose the right cutoff value to generate an optimal classifier. Furthermore, since they only utilize the most informative features (i.e., main ICD and NLP counts) they may fail for episodic phenotypes that cannot be consistently detected via ICD and NLP data. In this paper, we propose a label-efficient, weakly semi-supervised deep learning algorithm for EHR phenotyping (WSS-DL), which overcomes the limitations above.Materials and methodsWSS-DL classifies patient-level disease status through a series of learning stages: 1) generating silver standard labels, 2) deriving enhanced-silver-standard labels by fitting a weakly supervised deep learning model to data with silver standard labels as outcomes and high dimensional EHR features as input, and 3) obtaining the final prediction score and classifier by fitting a supervised learning model to data with a minimal number of gold standard labels as the outcome, and the enhanced-silver-standard labels and a minimal set of most informative EHR features as input. To assess the generalizability of WSS-DL across different phenotypes and medical institutions, we apply WSS-DL to classify a total of 17 diseases, including both acute and chronic conditions, using EHR data from three healthcare systems. Additionally, we determine the minimum quantity of training labels required by WSS-DL to outperform existing supervised and semi-supervised phenotyping methods.ResultsThe proposed method, in combining the strengths of deep learning and weakly semi-supervised learning, successfully leverages the crucial phenotyping information contained in EHR features from unlabeled samples. Indeed, the deep learning model's ability to handle high-dimensional EHR features allows it to generate strong phenotype status predictions from silver standard labels. These predictions, in turn, provide highly effective features in the final logistic regression stage, leading to high phenotyping accuracy in notably small subsets of labeled data (e.g. n = 40 labeled samples).ConclusionOur method's high performance in EHR datasets with very small numbers of labels indicates its potential value in aiding doctors to diagnose rare diseases as well as conditions susceptible to misdiagnosis.
Project description:For Brain-Computer interfaces (BCIs), system calibration is a lengthy but necessary process for successful operation. Co-adaptive BCIs aim to shorten training and imply positive motivation to users by presenting feedback already at early stages: After just 5 min of gathering calibration data, the systems are able to provide feedback and engage users in a mutual learning process. In this work, we investigate whether the retraining stage of co-adaptive BCIs can be adapted to a semi-supervised concept, where only a small amount of labeled data is available and all additional data needs to be labeled by the BCI itself. The aim of the current work was to evaluate whether a semi-supervised co-adaptive BCI could successfully compete with a supervised co-adaptive BCI model. In a supporting two-class (190 trials per condition) BCI study based on motor imagery tasks, we evaluated both approaches in two separate groups of 10 participants online, while we simulated the other approach in each group offline. Our results indicate that despite the lack of true labeled data, the semi-supervised driven BCI did not perform significantly worse (p > 0.05) than the supervised counterpart. We believe that these findings contribute to developing BCIs for long-term use, where continuous adaptation becomes imperative for maintaining meaningful BCI performance. Graphical abstract In this work, we investigate whether the retraining stage of a co-adaptive BCI can be adapted to a semi-supervised concept, where only a small amount of labeled data is available and all additional data needs to be labeled by the BCI itself. In two groups of 10 persons, we evaluate a supervised as well as a semi-supervised approach. Our results indicate that despite the lack of true labeled data, the semi-supervised driven BCI did not perform significantly worse (p > 0.05) than the supervised counterpart.
Project description:ObjectiveHigh-throughput phenotyping will accelerate the use of electronic health records (EHRs) for translational research. A critical roadblock is the extensive medical supervision required for phenotyping algorithm (PA) estimation and evaluation. To address this challenge, numerous weakly-supervised learning methods have been proposed. However, there is a paucity of methods for reliably evaluating the predictive performance of PAs when a very small proportion of the data is labeled. To fill this gap, we introduce a semi-supervised approach (ssROC) for estimation of the receiver operating characteristic (ROC) parameters of PAs (eg, sensitivity, specificity).Materials and methodsssROC uses a small labeled dataset to nonparametrically impute missing labels. The imputations are then used for ROC parameter estimation to yield more precise estimates of PA performance relative to classical supervised ROC analysis (supROC) using only labeled data. We evaluated ssROC with synthetic, semi-synthetic, and EHR data from Mass General Brigham (MGB).ResultsssROC produced ROC parameter estimates with minimal bias and significantly lower variance than supROC in the simulated and semi-synthetic data. For the 5 PAs from MGB, the estimates from ssROC are 30% to 60% less variable than supROC on average.DiscussionssROC enables precise evaluation of PA performance without demanding large volumes of labeled data. ssROC is also easily implementable in open-source R software.ConclusionWhen used in conjunction with weakly-supervised PAs, ssROC facilitates the reliable and streamlined phenotyping necessary for EHR-based research.
Project description:Unsupervised clustering models have been widely used for multimetric phenotyping of complex and heterogeneous diseases such as diabetes and obstructive sleep apnea (OSA) to more precisely characterize the disease beyond simplistic conventional diagnosis standards. However, the number of clusters and key phenotypic features have been subjectively selected, reducing the reliability of the phenotyping results. Here, to minimize such subjective decisions for highly confident phenotyping, we develop a multimetric phenotyping framework by combining supervised and unsupervised machine learning. This clusters 2277 OSA patients to six phenotypes based on their multidimensional polysomnography (PSG) data. Importantly, these new phenotypes show statistically different comorbidity development for OSA-related cardio-neuro-metabolic diseases, unlike the conventional single-metric apnea-hypopnea index-based phenotypes. Furthermore, the key features of highly comorbid phenotypes were identified through supervised learning rather than subjective choice. These results can also be used to automatically phenotype new patients and predict their comorbidity risks solely based on their PSG data. The phenotyping framework based on the combination of unsupervised and supervised machine learning methods can also be applied to other complex, heterogeneous diseases for phenotyping patients and identifying important features for high-risk phenotypes.
Project description:Reaching the performance of fully supervised learning with unlabeled data and only labeling one sample per class might be ideal for deep learning applications. We demonstrate for the first time the potential for building one-shot semi-supervised (BOSS) learning on CIFAR-10 and SVHN up to attain test accuracies that are comparable to fully supervised learning. Our method combines class prototype refining, class balancing, and self-training. A good prototype choice is essential and we propose a technique for obtaining iconic examples. In addition, we demonstrate that class balancing methods substantially improve accuracy results in semi-supervised learning to levels that allow self-training to reach the level of fully supervised learning performance. Our experiments demonstrate the value with computing and analyzing test accuracies for every class, rather than only a total test accuracy. We show that our BOSS methodology can obtain total test accuracies with CIFAR-10 images and only one labeled sample per class up to 95% (compared to 94.5% for fully supervised). Similarly, the SVHN images obtains test accuracies of 97.8%, compared to 98.27% for fully supervised. Rigorous empirical evaluations provide evidence that labeling large datasets is not necessary for training deep neural networks. Our code is available at https://github.com/lnsmith54/BOSS to facilitate replication.
Project description:Many biological problems are understudied due to experimental limitations and human biases. Although deep learning is promising in accelerating scientific discovery, its power compromises when applied to problems with scarcely labeled data and data distribution shifts. We develop a deep learning framework-Meta Model Agnostic Pseudo Label Learning (MMAPLE)-to address these challenges by effectively exploring out-of-distribution (OOD) unlabeled data when conventional transfer learning fails. The uniqueness of MMAPLE is to integrate the concept of meta-learning, transfer learning and semi-supervised learning into a unified framework. The power of MMAPLE is demonstrated in three applications in an OOD setting where chemicals or proteins in unseen data are dramatically different from those in training data: predicting drug-target interactions, hidden human metabolite-enzyme interactions, and understudied interspecies microbiome metabolite-human receptor interactions. MMAPLE achieves 11% to 242% improvement in the prediction-recall on multiple OOD benchmarks over various base models. Using MMAPLE, we reveal novel interspecies metabolite-protein interactions that are validated by activity assays and fill in missing links in microbiome-human interactions. MMAPLE is a general framework to explore previously unrecognized biological domains beyond the reach of present experimental and computational techniques.
Project description:Traffic sign recognition is a classification problem that poses challenges for computer vision and machine learning algorithms. Although both computer vision and machine learning techniques have constantly been improved to solve this problem, the sudden rise in the number of unlabeled traffic signs has become even more challenging. Large data collation and labeling are tedious and expensive tasks that demand much time, expert knowledge, and fiscal resources to satisfy the hunger of deep neural networks. Aside from that, the problem of having unbalanced data also poses a greater challenge to computer vision and machine learning algorithms to achieve better performance. These problems raise the need to develop algorithms that can fully exploit a large amount of unlabeled data, use a small amount of labeled samples, and be robust to data imbalance to build an efficient and high-quality classifier. In this work, we propose a novel semi-supervised classification technique that is robust to small and unbalanced data. The framework integrates weakly-supervised learning and self-training with self-paced learning to generate attention maps to augment the training set and utilizes a novel pseudo-label generation and selection algorithm to generate and select pseudo-labeled samples. The method improves the performance by: (1) normalizing the class-wise confidence levels to prevent the model from ignoring hard-to-learn samples, thereby solving the imbalanced data problem; (2) jointly learning a model and optimizing pseudo-labels generated on unlabeled data; and (3) enlarging the training set to satisfy the hunger of deep learning models. Extensive evaluations on two public traffic sign recognition datasets demonstrate the effectiveness of the proposed technique and provide a potential solution for practical applications.