Project description:Introduction: This study investigates whether it is possible to predict a final diagnosis based on a written nephropathological description—as a surrogate for image analysis—using various NLP methods. Methods: For this work, 1107 unlabelled nephropathological reports were included. (i) First, after separating each report into its microscopic description and diagnosis section, the diagnosis sections were clustered unsupervised to less than 20 diagnostic groups using different clustering techniques. (ii) Second, different text classification methods were used to predict the diagnostic group based on the microscopic description section. Results: The best clustering results (i) could be achieved with HDBSCAN, using BoW-based feature extraction methods. Based on keywords, these clusters can be mapped to certain diagnostic groups. A transformer encoder-based approach as well as an SVM worked best regarding diagnosis prediction based on the histomorphological description (ii). Certain diagnosis groups reached F1-scores of up to 0.892 while others achieved weak classification metrics. Conclusion: While textual morphological description alone enables retrieving the correct diagnosis for some entities, it does not work sufficiently for other entities. This is in accordance with a previous image analysis study on glomerular change patterns, where some diagnoses are associated with one pattern, but for others, there exists a complex pattern combination.
Project description:Choosing a comprehensive and cost-effective way of articulating and annotating the sentiment of a text is not a trivial task, particularly when dealing with short texts, in which sentiment can be expressed through a wide variety of linguistic and rhetorical phenomena. This problem is especially conspicuous in resource-limited settings and languages, where design options are restricted either in terms of manpower and financial means required to produce appropriate sentiment analysis resources, or in terms of available language tools, or both. In this paper, we present a versatile approach to addressing this issue, based on multiple interpretations of sentiment labels that encode information regarding the polarity, subjectivity, and ambiguity of a text, as well as the presence of sarcasm or a mixture of sentiments. We demonstrate its use on Serbian, a resource-limited language, via the creation of a main sentiment analysis dataset focused on movie comments, and two smaller datasets belonging to the movie and book domains. In addition to measuring the quality of the annotation process, we propose a novel metric to validate its cost-effectiveness. Finally, the practicality of our approach is further validated by training, evaluating, and determining the optimal configurations of several different kinds of machine-learning models on a range of sentiment classification tasks using the produced dataset.
Project description:Microarray probes and reads from massively parallel sequencing technologies are two most widely used genomic tags for a transcriptome study. Names and underlying technologies might differ, but expression technologies share a common objective-to obtain mRNA abundance values at the gene level, with high sensitivity and specificity. However, the initial tag annotation becomes obsolete as more insight is gained into biological references (genome, transcriptome, SNP, etc.). While novel alignment algorithms for short reads are being released every month, solutions for rapid annotation of tags are rare. We have developed a generic matching algorithm that uses genomic positions for rapid custom-annotation of tags with a time complexity O(nlogn). We demonstrate our algorithm on the custom annotation of Illumina massively parallel sequencing reads and Affymetrix microarray probes and identification of alternatively spliced regions.
Project description:BackgroundElectronic health records (EHRs) with large sample sizes and rich information offer great potential for dementia research, but current methods of phenotyping cognitive status are not scalable.ObjectiveThe aim of this study was to evaluate whether natural language processing (NLP)-powered semiautomated annotation can improve the speed and interrater reliability of chart reviews for phenotyping cognitive status.MethodsIn this diagnostic study, we developed and evaluated a semiautomated NLP-powered annotation tool (NAT) to facilitate phenotyping of cognitive status. Clinical experts adjudicated the cognitive status of 627 patients at Mass General Brigham (MGB) health care, using NAT or traditional chart reviews. Patient charts contained EHR data from two data sets: (1) records from January 1, 2017, to December 31, 2018, for 100 Medicare beneficiaries from the MGB Accountable Care Organization and (2) records from 2 years prior to COVID-19 diagnosis to the date of COVID-19 diagnosis for 527 MGB patients. All EHR data from the relevant period were extracted; diagnosis codes, medications, and laboratory test values were processed and summarized; clinical notes were processed through an NLP pipeline; and a web tool was developed to present an integrated view of all data. Cognitive status was rated as cognitively normal, cognitively impaired, or undetermined. Assessment time and interrater agreement of NAT compared to manual chart reviews for cognitive status phenotyping was evaluated.ResultsNAT adjudication provided higher interrater agreement (Cohen κ=0.89 vs κ=0.80) and significant speed up (time difference mean 1.4, SD 1.3 minutes; P<.001; ratio median 2.2, min-max 0.4-20) over manual chart reviews. There was moderate agreement with manual chart reviews (Cohen κ=0.67). In the cases that exhibited disagreement with manual chart reviews, NAT adjudication was able to produce assessments that had broader clinical consensus due to its integrated view of highlighted relevant information and semiautomated NLP features.ConclusionsNAT adjudication improves the speed and interrater reliability for phenotyping cognitive status compared to manual chart reviews. This study underscores the potential of an NLP-based clinically adjudicated method to build large-scale dementia research cohorts from EHRs.
Project description:Prior research found reliable and considerably strong effects of semantic achievement primes on subsequent performance. In order to simulate a more natural priming condition to better understand the practical relevance of semantic achievement priming effects, running texts of schoolbook excerpts with and without achievement primes were used as priming stimuli. Additionally, we manipulated the achievement context; some subjects received no feedback about their achievement and others received feedback according to a social or individual reference norm. As expected, we found a reliable (albeit small) positive behavioral priming effect of semantic achievement primes on achievement in math (Experiment 1) and language tasks (Experiment 2). Feedback moderated the behavioral priming effect less consistently than we expected. The implication that achievement primes in schoolbooks can foster performance is discussed along with general theoretical implications.
Project description:Electronic medical records (EMRs) are increasingly repurposed for activities beyond clinical care, such as to support translational research and public policy analysis. To mitigate privacy risks, healthcare organizations (HCOs) aim to remove potentially identifying patient information. A substantial quantity of EMR data is in natural language form and there are concerns that automated tools for detecting identifiers are imperfect and leak information that can be exploited by ill-intentioned data recipients. Thus, HCOs have been encouraged to invest as much effort as possible to find and detect potential identifiers, but such a strategy assumes the recipients are sufficiently incentivized and capable of exploiting leaked identifiers. In practice, such an assumption may not hold true and HCOs may overinvest in de-identification technology. The goal of this study is to design a natural language de-identification framework, rooted in game theory, which enables an HCO to optimize their investments given the expected capabilities of an adversarial recipient.We introduce a Stackelberg game to balance risk and utility in natural language de-identification. This game represents a cost-benefit model that enables an HCO with a fixed budget to minimize their investment in the de-identification process. We evaluate this model by assessing the overall payoff to the HCO and the adversary using 2100 clinical notes from Vanderbilt University Medical Center. We simulate several policy alternatives using a range of parameters, including the cost of training a de-identification model and the loss in data utility due to the removal of terms that are not identifiers. In addition, we compare policy options where, when an attacker is fined for misuse, a monetary penalty is paid to the publishing HCO as opposed to a third party (e.g., a federal regulator).Our results show that when an HCO is forced to exhaust a limited budget (set to $2000 in the study), the precision and recall of the de-identification of the HCO are 0.86 and 0.8, respectively. A game-based approach enables a more refined cost-benefit tradeoff, improving both privacy and utility for the HCO. For example, our investigation shows that it is possible for an HCO to release the data without spending all their budget on de-identification and still deter the attacker, with a precision of 0.77 and a recall of 0.61 for the de-identification. There also exist scenarios in which the model indicates an HCO should not release any data because the risk is too great. In addition, we find that the practice of paying fines back to a HCO (an artifact of suing for breach of contract), as opposed to a third party such as a federal regulator, can induce an elevated level of data sharing risk, where the HCO is incentivized to bait the attacker to elicit compensation.A game theoretic framework can be applied in leading HCO's to optimized decision making in natural language de-identification investments before sharing EMR data.
Project description:Accurately selecting relevant alleles in large sequencing experiments remains technically challenging. Bystro ( https://bystro.io/ ) is the first online, cloud-based application that makes variant annotation and filtering accessible to all researchers for terabyte-sized whole-genome experiments containing thousands of samples. Its key innovation is a general-purpose, natural-language search engine that enables users to identify and export alleles and samples of interest in milliseconds. The search engine dramatically simplifies complex filtering tasks that previously required programming experience or specialty command-line programs. Critically, Bystro's annotation and filtering capabilities are orders of magnitude faster than previous solutions, saving weeks of processing time for large experiments.
Project description:The gap between domain experts and natural language processing expertise is a barrier to extracting understanding from clinical text. We describe a prototype tool for interactive review and revision of natural language processing models of binary concepts extracted from clinical notes. We evaluated our prototype in a user study involving 9 physicians, who used our tool to build and revise models for 2 colonoscopy quality variables. We report changes in performance relative to the quantity of feedback. Using initial training sets as small as 10 documents, expert review led to final F1scores for the "appendiceal-orifice" variable between 0.78 and 0.91 (with improvements ranging from 13.26% to 29.90%). F1for "biopsy" ranged between 0.88 and 0.94 (-1.52% to 11.74% improvements). The average System Usability Scale score was 70.56. Subjective feedback also suggests possible design improvements.
Project description:Large volumes of data are continuously generated from clinical notes and diagnostic studies catalogued in electronic health records (EHRs). Echocardiography is one of the most commonly ordered diagnostic tests in cardiology. This study sought to explore the feasibility and reliability of using natural language processing (NLP) for large-scale and targeted extraction of multiple data elements from echocardiography reports. An NLP tool, EchoInfer, was developed to automatically extract data pertaining to cardiovascular structure and function from heterogeneously formatted echocardiographic data sources. EchoInfer was applied to echocardiography reports (2004 to 2013) available from 3 different on-going clinical research projects. EchoInfer analyzed 15,116 echocardiography reports from 1684 patients, and extracted 59 quantitative and 21 qualitative data elements per report. EchoInfer achieved a precision of 94.06%, a recall of 92.21%, and an F1-score of 93.12% across all 80 data elements in 50 reports. Physician review of 400 reports demonstrated that EchoInfer achieved a recall of 92-99.9% and a precision of >97% in four data elements, including three quantitative and one qualitative data element. Failure of EchoInfer to correctly identify or reject reported parameters was primarily related to non-standardized reporting of echocardiography data. EchoInfer provides a powerful and reliable NLP-based approach for the large-scale, targeted extraction of information from heterogeneous data sources. The use of EchoInfer may have implications for the clinical management and research analysis of patients undergoing echocardiographic evaluation.