Dataset Information

Identifying medical terms in patient-authored text: a crowdsourcing-based approach.

ABSTRACT: As people increasingly engage in online health-seeking behavior and contribute to health-oriented websites, the volume of medical text authored by patients and other medical novices grows rapidly. However, we lack an effective method for automatically identifying medical terms in patient-authored text (PAT). We demonstrate that crowdsourcing PAT medical term identification tasks to non-experts is a viable method for creating large, accurately-labeled PAT datasets; moreover, such datasets can be used to train classifiers that outperform existing medical term identification tools.To evaluate the viability of using non-expert crowds to label PAT, we compare expert (registered nurses) and non-expert (Amazon Mechanical Turk workers; Turkers) responses to a PAT medical term identification task. Next, we build a crowd-labeled dataset comprising 10 000 sentences from MedHelp. We train two models on this dataset and evaluate their performance, as well as that of MetaMap, Open Biomedical Annotator (OBA), and NaCTeM's TerMINE, against two gold standard datasets: one from MedHelp and the other from CureTogether.When aggregated according to a corroborative voting policy, Turker responses predict expert responses with an F1 score of 84%. A conditional random field (CRF) trained on 10 000 crowd-labeled MedHelp sentences achieves an F1 score of 78% against the CureTogether gold standard, widely outperforming OBA (47%), TerMINE (43%), and MetaMap (39%). A failure analysis of the CRF suggests that misclassified terms are likely to be either generic or rare.Our results show that combining statistical models sensitive to sentence-level context with crowd-labeled data is a scalable and effective technique for automatically identifying medical terms in PAT.

SUBMITTER: MacLean DL

PROVIDER: S-EPMC3822103 | biostudies-literature | 2013 Nov-Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Identifying medical terms in patient-authored text: a crowdsourcing-based approach.

MacLean Diana Lynn DL Heer Jeffrey J

Journal of the American Medical Informatics Association : JAMIA 20130505 6

<h4>Background and objective</h4>As people increasingly engage in online health-seeking behavior and contribute to health-oriented websites, the volume of medical text authored by patients and other medical novices grows rapidly. However, we lack an effective method for automatically identifying medical terms in patient-authored text (PAT). We demonstrate that crowdsourcing PAT medical term identification tasks to non-experts is a viable method for creating large, accurately-labeled PAT datasets ...[more]

PMID: 23645553

Similar Datasets

Project description:BackgroundEnrollment in pregnancy registries is challenging despite substantial awareness-raising activities, generally resulting in low recruitment owing to limited safety data. Understanding patient and physician awareness of and attitudes toward pregnancy registries is needed to facilitate enrollment. Crowdsourcing, in which services, ideas, or content are obtained by soliciting contributions from a large group of people using web-based platforms, has shown promise for improving patient engagement and obtaining patient insights.ObjectiveThis study aimed to use web-based crowdsourcing platforms to evaluate Belimumab Pregnancy Registry (BPR) awareness among patients and physicians and to identify potential barriers to pregnancy registry enrollment with the BPR as a case study.MethodsWe conducted 2 surveys using separate web-based crowdsourcing platforms: Amazon Mechanical Turk (a 14-question patient survey) and Sermo RealTime (a 11-question rheumatologist survey). Eligible patients were women, aged 18-55 years; diagnosed with systemic lupus erythematosus (SLE); and pregnant, recently pregnant (within 2 years), or planning pregnancy. Eligible rheumatologists had prescribed belimumab and treated pregnant women. Responses were descriptively analyzed.ResultsOf 151 patient respondents over a 3-month period (n=88, 58.3% aged 26-35 years; n=149, 98.7% with mild or moderate SLE; and n=148, 98% from the United States), 51% (77/151) were currently or recently pregnant. Overall, 169 rheumatologists completed the survey within 48 hours, and 59.2% (100/169) were based in the United States. Belimumab exposure was reported by 41.7% (63/151) patients, whereas 51.7% (75/145) rheumatologists had prescribed belimumab to <5 patients, 25.5% (37/145) had prescribed to 5-10 patients, and 22.8% (33/145) had prescribed to >10 patients who were pregnant or trying to conceive. Of the patients exposed to belimumab, 51% (32/63) were BPR-aware, and 45.5% (77/169) of the rheumatologists were BPR-aware. Overall, 60% (38/63) of patients reported belimumab discontinuation because of pregnancy or planned pregnancy. Among the 77 BPR-aware rheumatologists, 70 (91%) referred patients to the registry. Concerns among rheumatologists who did not prescribe belimumab during pregnancy included unknown pregnancy safety profile (119/169, 70.4%), and 61.5% (104/169) reported their patients' concerns about the unknown pregnancy safety profile. Belimumab exposure during or recently after pregnancy or while trying to conceive was reported in patients with mild (6/64, 9%), moderate (22/85, 26%), or severe (1/2, 50%) SLE. Rheumatologists more commonly recommended belimumab for moderate (84/169, 49.7%) and severe (123/169, 72.8%) SLE than for mild SLE (36/169, 21.3%) for patients trying to conceive recently or currently pregnant. Overall, 81.6% (138/169) of the rheumatologists suggested a belimumab washout period before pregnancy of 0-30 days (44/138, 31.9%), 30-60 days (64/138, 46.4%), or >60 days (30/138, 21.7%).ConclusionsIn this case, crowdsourcing efficiently obtained patient and rheumatologist input, with some patients with SLE continuing to use belimumab during or while planning a pregnancy. There was moderate awareness of the BPR among patients and physicians.

Project description:BackgroundMedical terms are a major obstacle for patients to comprehend their electronic health record (EHR) notes. Clinical natural language processing (NLP) systems that link EHR terms to lay terms or definitions allow patients to easily access helpful information when reading through their EHR notes, and have shown to improve patient EHR comprehension. However, high-quality lay language resources for EHR terms are very limited in the public domain. Because expanding and curating such a resource is a costly process, it is beneficial and even necessary to identify terms important for patient EHR comprehension first.ObjectiveWe aimed to develop an NLP system, called adapted distant supervision (ADS), to rank candidate terms mined from EHR corpora. We will give EHR terms ranked as high by ADS a higher priority for lay language annotation-that is, creating lay definitions for these terms.MethodsAdapted distant supervision uses distant supervision from consumer health vocabulary and transfer learning to adapt itself to solve the problem of ranking EHR terms in the target domain. We investigated 2 state-of-the-art transfer learning algorithms (ie, feature space augmentation and supervised distant supervision) and designed 5 types of learning features, including distributed word representations learned from large EHR data for ADS. For evaluating ADS, we asked domain experts to annotate 6038 candidate terms as important or nonimportant for EHR comprehension. We then randomly divided these data into the target-domain training data (1000 examples) and the evaluation data (5038 examples). We compared ADS with 2 strong baselines, including standard supervised learning, on the evaluation data.ResultsThe ADS system using feature space augmentation achieved the best average precision, 0.850, on the evaluation set when using 1000 target-domain training examples. The ADS system using supervised distant supervision achieved the best average precision, 0.819, on the evaluation set when using only 100 target-domain training examples. The 2 ADS systems both performed significantly better than the baseline systems (P<.001 for all measures and all conditions). Using a rich set of learning features contributed to ADS's performance substantially.ConclusionsADS can effectively rank terms mined from EHRs. Transfer learning improved ADS's performance even with a small number of target-domain training examples. EHR terms prioritized by ADS were used to expand a lay language resource that supports patient EHR comprehension. The top 10,000 EHR terms ranked by ADS are available upon request.

Dataset Information

Identifying medical terms in patient-authored text: a crowdsourcing-based approach.

Publications

Identifying medical terms in patient-authored text: a crowdsourcing-based approach.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets