Project description:The use of electronic medical record data linked to biological specimens in health care settings is expected to enable cost-effective and rapid genomic analyses. Here, we present a model that highlights potential advantages for genomic discovery and describe the operational infrastructure that facilitated multiple simultaneous discovery efforts.
Project description:In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove's ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.
Project description:Health related quality of life (HRQOL) is an important variable used for prognosis and measuring outcomes in clinical studies and for quality improvement. We explore the use of a general pur-pose natural language processing system Metamap in combination with Support Vector Machines (SVM) for predicting patient responses on standardized HRQOL assessment instruments from text of physicians notes. We surveyed 669 patients in the Mayo Clinic diabetes registry using two instruments designed to assess functioning: EuroQoL5D and SF36/SD6. Clinical notes for these patients were represented as sets of medical concepts using Metamap. SVM classifiers were trained using various feature selection strategies. The best concordance between the HRQOL instruments and automatic classification was achieved along the pain dimension (positive agreement .76, negative agreement .78, kappa .54) using Metamap. We conclude that clinicians notes may be used to develop a surrogate measure of patients HRQOL status.
Project description:We aimed to mine the data in the Electronic Medical Record to automatically discover patients' Rheumatoid Arthritis disease activity at discrete rheumatology clinic visits. We cast the problem as a document classification task where the feature space includes concepts from the clinical narrative and lab values as stored in the Electronic Medical Record.The Training Set consisted of 2792 clinical notes and associated lab values. Test Set 1 included 1749 clinical notes and associated lab values. Test Set 2 included 344 clinical notes for which there were no associated lab values. The Apache clinical Text Analysis and Knowledge Extraction System was used to analyze the text and transform it into informative features to be combined with relevant lab values.Experiments over a range of machine learning algorithms and features were conducted. The best performing combination was linear kernel Support Vector Machines with Unified Medical Language System Concept Unique Identifier features with feature selection and lab values. The Area Under the Receiver Operating Characteristic Curve (AUC) is 0.831 (??=?0.0317), statistically significant as compared to two baselines (AUC?=?0.758, ??=?0.0291). Algorithms demonstrated superior performance on cases clinically defined as extreme categories of disease activity (Remission and High) compared to those defined as intermediate categories (Moderate and Low) and included laboratory data on inflammatory markers.Automatic Rheumatoid Arthritis disease activity discovery from Electronic Medical Record data is a learnable task approximating human performance. As a result, this approach might have several research applications, such as the identification of patients for genome-wide pharmacogenetic studies that require large sample sizes with precise definitions of disease activity and response to therapies.
Project description:Questions are often lengthy and difficult to understand because they tend to contain peripheral information. Previous work relies on costly human-annotated data or question-title pairs. In this work, we propose a distant supervision framework that can train a question summarizer without annotation costs or question-title pairs, where sentences are automatically annotated by means of heuristic rules. The key idea is that a single-sentence question tends to have a summary-like property. We empirically show that our models trained on the framework perform competitively with respect to supervised models without the requirement of a costly human-annotated dataset.
Project description:We consider the task of Medical Concept Normalization (MCN) which aims to map informal medical phrases such as "loosing weight" to formal medical concepts, such as "Weight loss". Deep learning models have shown high performance across various MCN datasets containing small number of target concepts along with adequate number of training examples per concept. However, scaling these models to millions of medical concepts entails the creation of much larger datasets which is cost and effort intensive. Recent works have shown that training MCN models using automatically labeled examples extracted from medical knowledge bases partially alleviates this problem. We extend this idea by computationally creating a distant dataset from patient discussion forums. We extract informal medical phrases and medical concepts from these forums using a synthetically trained classifier and an off-the-shelf medical entity linker respectively. We use pretrained sentence encoding models to find the k-nearest phrases corresponding to each medical concept. These mappings are used in combination with the examples obtained from medical knowledge bases to train an MCN model. Our approach outperforms the previous state-of-the-art by 15.9% and 17.1% classification accuracy across two datasets while avoiding manual labeling.
Project description:Accurately identifying distant recurrences in breast cancer from the Electronic Health Records (EHR) is important for both clinical care and secondary analysis. Although multiple applications have been developed for computational phenotyping in breast cancer, distant recurrence identification still relies heavily on manual chart review. In this study, we aim to develop a model that identifies distant recurrences in breast cancer using clinical narratives and structured data from EHR. We applied MetaMap to extract features from clinical narratives and also retrieved structured clinical data from EHR. Using these features, we trained a support vector machine model to identify distant recurrences in breast cancer patients. We trained the model using 1,396 double-annotated subjects and validated the model using 599 double-annotated subjects. In addition, we validated the model on a set of 4,904 single-annotated subjects as a generalization test. In the held-out test and generalization test, we obtained F-measure scores of 0.78 and 0.74, area under curve (AUC) scores of 0.95 and 0.93, respectively. To explore the representation learning utility of deep neural networks, we designed multiple convolutional neural networks and multilayer neural networks to identify distant recurrences. Using the same test set and generalizability test set, we obtained F-measure scores of 0.79 ± 0.02 and 0.74 ± 0.004, AUC scores of 0.95 ± 0.002 and 0.95 ± 0.01, respectively. Our model can accurately and efficiently identify distant recurrences in breast cancer by combining features extracted from unstructured clinical narratives and structured clinical data.
Project description:ObjectivesElectronic health records (EHR) are commonly used for the identification of novel risk factors for disease, often referred to as an association study. A major challenge to EHR-based association studies is phenotyping error in EHR-derived outcomes. A manual chart review of phenotypes is necessary for unbiased evaluation of risk factor associations. However, this process is time-consuming and expensive. The objective of this paper is to develop an outcome-dependent sampling approach for designing manual chart review, where EHR-derived phenotypes can be used to guide the selection of charts to be reviewed in order to maximize statistical efficiency in the subsequent estimation of risk factor associations.Materials and methodsAfter applying outcome-dependent sampling, an augmented estimator can be constructed by optimally combining the chart-reviewed phenotypes from the selected patients with the error-prone EHR-derived phenotype. We conducted simulation studies to evaluate the proposed method and applied our method to data on colon cancer recurrence in a cohort of patients treated for a primary colon cancer in the Kaiser Permanente Washington (KPW) healthcare system.ResultsSimulations verify the coverage probability of the proposed method and show that, when disease prevalence is less than 30%, the proposed method has smaller variance than an existing method where the validation set for chart review is uniformly sampled. In addition, from design perspective, the proposed method is able to achieve the same statistical power with 50% fewer charts to be validated than the uniform sampling method, thus, leading to a substantial efficiency gain in chart review. These findings were also confirmed by the application of the competing methods to the KPW colon cancer data.DiscussionOur simulation studies and analysis of data from KPW demonstrate that, compared to an existing uniform sampling method, the proposed outcome-dependent method can lead to a more efficient chart review sampling design and unbiased association estimates with higher statistical efficiency.ConclusionThe proposed method not only optimally combines phenotypes from chart review with EHR-derived phenotypes but also suggests an efficient design for conducting chart review, with the goal of improving the efficiency of estimated risk factor associations using EHR data.
Project description:Background:Panicle density of cereal crops such as wheat and sorghum is one of the main components for plant breeders and agronomists in understanding the yield of their crops. To phenotype the panicle density effectively, researchers agree there is a significant need for computer vision-based object detection techniques. Especially in recent times, research in deep learning-based object detection shows promising results in various agricultural studies. However, training such systems usually requires a lot of bounding-box labeled data. Since crops vary by both environmental and genetic conditions, acquisition of huge amount of labeled image datasets for each crop is expensive and time-consuming. Thus, to catalyze the widespread usage of automatic object detection for crop phenotyping, a cost-effective method to develop such automated systems is essential. Results:We propose a point supervision based active learning approach for panicle detection in cereal crops. In our approach, the model constantly interacts with a human annotator by iteratively querying the labels for only the most informative images, as opposed to all images in a dataset. Our query method is specifically designed for cereal crops which usually tend to have panicles with low variance in appearance. Our method reduces labeling costs by intelligently leveraging low-cost weak labels (object centers) for picking the most informative images for which strong labels (bounding boxes) are required. We show promising results on two publicly available cereal crop datasets-Sorghum and Wheat. On Sorghum, 6 variants of our proposed method outperform the best baseline method with more than 55% savings in labeling time. Similarly, on Wheat, 3 variants of our proposed methods outperform the best baseline method with more than 50% of savings in labeling time. Conclusion:We proposed a cost effective method to train reliable panicle detectors for cereal crops. A low cost panicle detection method for cereal crops is highly beneficial to both breeders and agronomists. Plant breeders can obtain quick crop yield estimates to make important crop management decisions. Similarly, obtaining real time visual crop analysis is valuable for researchers to analyze the crop's response to various experimental conditions.