Dataset Information

Automated detection of discourse segment and experimental types from the text of cancer pathway results sections.

ABSTRACT: Automated machine-reading biocuration systems typically use sentence-by-sentence information extraction to construct meaning representations for use by curators. This does not directly reflect the typical discourse structure used by scientists to construct an argument from the experimental data available within a article, and is therefore less likely to correspond to representations typically used in biomedical informatics systems (let alone to the mental models that scientists have). In this study, we develop Natural Language Processing methods to locate, extract, and classify the individual passages of text from articles' Results sections that refer to experimental data. In our domain of interest (molecular biology studies of cancer signal transduction pathways), individual articles may contain as many as 30 small-scale individual experiments describing a variety of findings, upon which authors base their overall research conclusions. Our system automatically classifies discourse segments in these texts into seven categories (fact, hypothesis, problem, goal, method, result, implication) with an F-score of 0.68. These segments describe the essential building blocks of scientific discourse to (i) provide context for each experiment, (ii) report experimental details and (iii) explain the data's meaning in context. We evaluate our system on text passages from articles that were curated in molecular biology databases (the Pathway Logic Datum repository, the Molecular Interaction MINT and INTACT databases) linking individual experiments in articles to the type of assay used (coprecipitation, phosphorylation, translocation etc.). We use supervised machine learning techniques on text passages containing unambiguous references to experiments to obtain baseline F1 scores of 0.59 for MINT, 0.71 for INTACT and 0.63 for Pathway Logic. Although preliminary, these results support the notion that targeting information extraction methods to experimental results could provide accurate, automated methods for biocuration. We also suggest the need for finer-grained curation of experimental methods used when constructing molecular biology databases.

SUBMITTER: Burns GA

PROVIDER: S-EPMC5006090 | biostudies-literature | 2016

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Automated detection of discourse segment and experimental types from the text of cancer pathway results sections.

Burns Gully A P C GA Dasigi Pradeep P de Waard Anita A Hovy Eduard H EH

Database : the journal of biological databases and curation 20160831

Automated machine-reading biocuration systems typically use sentence-by-sentence information extraction to construct meaning representations for use by curators. This does not directly reflect the typical discourse structure used by scientists to construct an argument from the experimental data available within a article, and is therefore less likely to correspond to representations typically used in biomedical informatics systems (let alone to the mental models that scientists have). In this st ...[more]

PMID: 27580922

Similar Datasets

Project description:BACKGROUND:The police attend numerous domestic violence events each year, recording details of these events as both structured (coded) data and unstructured free-text narratives. Abuse types (including physical, psychological, emotional, and financial) conducted by persons of interest (POIs) along with any injuries sustained by victims are typically recorded in long descriptive narratives. OBJECTIVE:We aimed to determine if an automated text mining method could identify abuse types and any injuries sustained by domestic violence victims in narratives contained in a large police dataset from the New South Wales Police Force. METHODS:We used a training set of 200 recorded domestic violence events to design a knowledge-driven approach based on syntactical patterns in the text and then applied this approach to a large set of police reports. RESULTS:Testing our approach on an evaluation set of 100 domestic violence events provided precision values of 90.2% and 85.0% for abuse type and victim injuries, respectively. In a set of 492,393 domestic violence reports, we found 71.32% (351,178) of events with mentions of the abuse type(s) and more than one-third (177,117 events; 35.97%) contained victim injuries. "Emotional/verbal abuse" (33.46%; 117,488) was the most common abuse type, followed by "punching" (86,322 events; 24.58%) and "property damage" (22.27%; 78,203 events). "Bruising" was the most common form of injury sustained (51,455 events; 29.03%), with "cut/abrasion" (28.93%; 51,284 events) and "red marks/signs" (23.71%; 42,038 events) ranking second and third, respectively. CONCLUSIONS:The results suggest that text mining can automatically extract information from police-recorded domestic violence events that can support further public health research into domestic violence, such as examining the relationship of abuse types with victim injuries and of gender and abuse types with risk escalation for victims of domestic violence. Potential also exists for this extracted information to be linked to information on the mental health status.

Project description:PURPOSE:To develop and test deep learning classifiers that detect gonioscopic angle closure and primary angle closure disease (PACD) based on fully automated analysis of anterior segment OCT (AS-OCT) images. METHODS:Subjects were recruited as part of the Chinese-American Eye Study (CHES), a population-based study of Chinese Americans in Los Angeles, California, USA. Each subject underwent a complete ocular examination including gonioscopy and AS-OCT imaging in each quadrant of the anterior chamber angle (ACA). Deep learning methods were used to develop 3 competing multi-class convolutional neural network (CNN) classifiers for modified Shaffer grades 0, 1, 2, 3, and 4. Binary probabilities for closed (grades 0 and 1) and open (grades 2, 3, and 4) angles were calculated by summing over the corresponding grades. Classifier performance was evaluated by 5-fold cross-validation and on an independent test dataset. Outcome measures included area under the receiver operating characteristic curve (AUC) for detecting gonioscopic angle closure and PACD, defined as either 2 or 3 quadrants of gonioscopic angle closure per eye. RESULTS:A total of 4036 AS-OCT images with corresponding gonioscopy grades (1943 open, 2093 closed) were obtained from 791 CHES subjects. Three competing CNN classifiers were developed with a cross-validation dataset of 3396 images (1632 open, 1764 closed) from 664 subjects. The remaining 640 images (311 open, 329 closed) from 127 subjects were segregated into a test dataset. The best-performing classifier was developed by applying transfer learning to the ResNet-18 architecture. For detecting gonioscopic angle closure, this classifier achieved an AUC of 0.933 (95% confidence interval, 0.925-0.941) on the cross-validation dataset and 0.928 on the test dataset. For detecting PACD based on 2- and 3-quadrant definitions, the ResNet-18 classifier achieved AUCs of 0.964 and 0.952, respectively, on the test dataset. CONCLUSION:Deep learning classifiers effectively detect gonioscopic angle closure and PACD based on automated analysis of AS-OCT images. These methods could be used to automate clinical evaluations of the ACA and improve access to eye care in high-risk populations.

Project description:BackgroundText-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification.MethodsWe describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus.ResultsPerformance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus.ConclusionWe have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.

Dataset Information

Automated detection of discourse segment and experimental types from the text of cancer pathway results sections.

Publications

Automated detection of discourse segment and experimental types from the text of cancer pathway results sections.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets