Dataset Information

Automatic Extraction of Lung Cancer Staging Information From Computed Tomography Reports: Deep Learning Approach.

ABSTRACT:

Background

Lung cancer is the leading cause of cancer deaths worldwide. Clinical staging of lung cancer plays a crucial role in making treatment decisions and evaluating prognosis. However, in clinical practice, approximately one-half of the clinical stages of lung cancer patients are inconsistent with their pathological stages. As one of the most important diagnostic modalities for staging, chest computed tomography (CT) provides a wealth of information about cancer staging, but the free-text nature of the CT reports obstructs their computerization.

Objective

We aimed to automatically extract the staging-related information from CT reports to support accurate clinical staging of lung cancer.

Methods

In this study, we developed an information extraction (IE) system to extract the staging-related information from CT reports. The system consisted of the following three parts: named entity recognition (NER), relation classification (RC), and postprocessing (PP). We first summarized 22 questions about lung cancer staging based on the TNM staging guideline. Next, three state-of-the-art NER algorithms were implemented to recognize the entities of interest. Next, we designed a novel RC method using the relation sign constraint (RSC) to classify the relations between entities. Finally, a rule-based PP module was established to obtain the formatted answers using the results of NER and RC.

Results

We evaluated the developed IE system on a clinical data set containing 392 chest CT reports collected from the Department of Thoracic Surgery II in the Peking University Cancer Hospital. The experimental results showed that the bidirectional encoder representation from transformers (BERT) model outperformed the iterated dilated convolutional neural networks-conditional random field (ID-CNN-CRF) and bidirectional long short-term memory networks-conditional random field (Bi-LSTM-CRF) for NER tasks with macro-F1 scores of 80.97% and 90.06% under the exact and inexact matching schemes, respectively. For the RC task, the proposed RSC showed better performance than the baseline methods. Further, the BERT-RSC model achieved the best performance with a macro-F1 score of 97.13% and a micro-F1 score of 98.37%. Moreover, the rule-based PP module could correctly obtain the formatted results using the extractions of NER and RC, achieving a macro-F1 score of 94.57% and a micro-F1 score of 96.74% for all the 22 questions.

Conclusions

We conclude that the developed IE system can effectively and accurately extract information about lung cancer staging from CT reports. Experimental results show that the extracted results have significant potential for further use in stage verification and prediction to facilitate accurate clinical staging.

SUBMITTER: Hu D

PROVIDER: S-EPMC8339987 | biostudies-literature | 2021 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Automatic Extraction of Lung Cancer Staging Information From Computed Tomography Reports: Deep Learning Approach.

Hu Danqing D Zhang Huanyao H Li Shaolei S Wang Yuhong Y Wu Nan N Lu Xudong X

JMIR medical informatics 20210721 7

<h4>Background</h4>Lung cancer is the leading cause of cancer deaths worldwide. Clinical staging of lung cancer plays a crucial role in making treatment decisions and evaluating prognosis. However, in clinical practice, approximately one-half of the clinical stages of lung cancer patients are inconsistent with their pathological stages. As one of the most important diagnostic modalities for staging, chest computed tomography (CT) provides a wealth of information about cancer staging, but the fre ...[more]

PMID: 34287213

Similar Datasets

Project description:BackgroundThis study aimed (I) to investigate the clinical implication of computed tomography (CT) cavity volume in tuberculosis (TB) and non-tuberculous mycobacterial pulmonary disease (NTM-PD), and (II) to develop a three-dimensional (3D) nnU-Net model to automatically detect and quantify cavity volume on CT images.MethodsWe retrospectively included conveniently sampled 206 TB and 186 NTM-PD patients in a tertiary referral hospital, who underwent thin-section chest CT scans from 2012 through 2019. TB was microbiologically confirmed, and NTM-PD was diagnosed by 2007 Infectious Diseases Society of America/American Thoracic Society guideline. The reference cavities were semi-automatically segmented on CT images and a 3D nnU-Net model was built with 298 cases (240 cases for training, 20 for tuning, and 38 for internal validation). Receiver operating characteristic curves were used to evaluate the accuracy of the CT cavity volume for two clinically relevant parameters: sputum smear positivity in TB and treatment in NTM-PD. The sensitivity and false-positive rate were calculated to assess the cavity detection of nnU-Net using radiologist-detected cavities as references, and the intraclass correlation coefficient (ICC) between the reference and the U-Net-derived cavity volumes was analyzed.ResultsThe mean CT cavity volumes in TB and NTM-PD patients were 11.3 and 16.4 cm3, respectively, and were significantly greater in smear-positive TB (P<0.001) and NTM-PD necessitating treatment (P=0.020). The CT cavity volume provided areas under the curve of 0.701 [95% confidence interval (CI): 0.620-0.782] for TB sputum positivity and 0.834 (95% CI: 0.773-0.894) for necessity of NTM-PD treatment. The nnU-Net provided per-patient sensitivity of 100% (19/19) and per-lesion sensitivity of 83.7% (41/49) in the validation dataset, with an average of 0.47 false-positive small cavities per patient (median volume, 0.26 cm3). The mean Dice similarity coefficient between the manually segmented cavities and the U-Net-derived cavities was 78.9. The ICCs between the reference and U-Net-derived volumes were 0.991 (95% CI: 0.983-0.995) and 0.933 (95% CI: 0.897-0.957) on a per-patient and per-lesion basis, respectively.ConclusionsCT cavity volume was associated with sputum positivity in TB and necessity of treatment in NTM-PD. The 3D nnU-Net model could automatically detect and quantify mycobacterial cavities on chest CT, helping assess TB infectivity and initiate NTM-TB treatment.

Project description:BackgroundPulmonary segments are valuable because they can provide more precise localization and intricate details of lung cancer than lung lobes. With advances in precision therapy, there is an increasing demand for the identification and visualization of pulmonary segments in computed tomography (CT) images to aid in the precise treatment of lung cancer. This study aimed to integrate multiple deep-learning models to accurately segment pulmonary segments in CT images using a bronchial tree (BT)-based approach.MethodsThe proposed segmentation method for pulmonary segments using the BT-based approach comprised the following five essential steps: (I) segmentation of the lung using a U-Net (R231) (public access) model; (II) segmentation of the lobes using a V-Net (self-developed) model; (III) segmentation of the airway using a combination of a differential geometric approach method and a BronchiNet (public access) model; (IV) labeling of the BT branches based on anatomical position; and (V) segmentation of the pulmonary segments based on the distance of each voxel to the labeled BT branches. This five-step process was applied to 14 high-resolution breath-hold CT images and compared against manual segmentations for evaluation.ResultsFor the lung segmentation, the lung mask had a mean dice similarity coefficient (DSC) of 0.98±0.03. For the lobe segmentation, the V-Net model had a mean DSC of 0.94±0.06. For the airway segmentation, the average total length of the segmented airway trees per image scan was 1,902.8±502.1 mm, and the average number of the maximum airway tree generations was 8.5±1.3. For the segmentation of the pulmonary segments, the proposed method had a DSC of 0.73±0.11 and a mean surface distance of 6.1±2.9 mm.ConclusionsThis study demonstrated the feasibility of combining multiple deep-learning models for the auxiliary segmentation of pulmonary segments on CT images using a BT-based approach. The results highlighted the potential of the BT-based method for the semi-automatic segmentation of the pulmonary segment.

Project description:BackgroundComputed tomography (CT) chest scans have become commonly used in clinical diagnosis. Image quality assessment (IQA) for CT images plays an important role in CT examination. It is worth noting that IQA is still a manual and subjective process, and even experienced radiologists make mistakes due to human limitations (fatigue, perceptual biases, and cognitive biases). There are also kinds of biases because of poor consensus among radiologists. Excellent IQA methods can reliably give an objective evaluation result and also reduce the workload of radiologists. This study proposes a deep learning (DL)-based automatic IQA method, to assess whether the image quality of respiratory phase on CT chest images are optimal or not, so that the CT chest images can be used in the patient's physical condition assessment.MethodsThis retrospective study analysed 212 patients' chest CT images, with 188 patients allocated to a training set (150 patients), validation set (18 patients), and a test set (20 patients). The remaining 24 patients were used for the observer study. Data augmentation methods were applied to address the problem of insufficient data. The DL-based IQA method combines image selection, tracheal carina segmentation, and bronchial beam detection. To automatically select the CT image containing the tracheal carina, an image selection model was employed. Afterward, the area-based approach and score-based approach were proposed and used to further optimize the tracheal carina segmentation and bronchial beam detection results, respectively. Finally, the score about the image quality of the patient's respiratory phase images given by the DL-based automatic IQA method was compared with the mean opinion score (MOS) given in the observer study, in which four blinded experienced radiologists took part.ResultsThe DL-based automatic IQA method achieved good performance in assessing the image quality of the respiratory phase images. For the CT sequence of the same patient, the DL-based IQA method had an accuracy of 92% in the assessment score, while the radiologists had an accuracy of 88%. The Kappa value of the assessment score between the DL-based IQA method and radiologists was 0.75, with a sensitivity of 85%, specificity of 91%, positive predictive value (PPV) of 92%, negative predictive value (NPV) of 93%, and accuracy of 88%.ConclusionsThis study develops and validates a DL-based automatic IQA method for the respiratory phase on CT chest images. The performance of this method surpassed that of the experienced radiologists on the independent test set used in this study. In clinical practice, it is possible to reduce the workload of radiologists and minimize errors caused by human limitations.

Project description:BackgroundElectronic health records store large amounts of patient clinical data. Despite efforts to structure patient data, clinical notes containing rich patient information remain stored as free text, greatly limiting its exploitation. This includes family history, which is highly relevant for applications such as diagnosis and prognosis.ObjectiveThis study aims to develop automatic strategies for annotating family history information in clinical notes, focusing not only on the extraction of relevant entities such as family members and disease mentions but also on the extraction of relations between the identified entities.MethodsThis study extends a previous contribution for the 2019 track on family history extraction from national natural language processing clinical challenges by improving a previously developed rule-based engine, using deep learning (DL) approaches for the extraction of entities from clinical notes, and combining both approaches in a hybrid end-to-end system capable of successfully extracting family member and observation entities and the relations between those entities. Furthermore, this study analyzes the impact of factors such as the use of external resources and different types of embeddings in the performance of DL models.ResultsThe approaches developed were evaluated in a first task regarding entity extraction and in a second task concerning relation extraction. The proposed DL approach improved observation extraction, obtaining F1 scores of 0.8688 and 0.7907 in the training and test sets, respectively. However, DL approaches have limitations in the extraction of family members. The rule-based engine was adjusted to have higher generalizing capability and achieved family member extraction F1 scores of 0.8823 and 0.8092 in the training and test sets, respectively. The resulting hybrid system obtained F1 scores of 0.8743 and 0.7979 in the training and test sets, respectively. For the second task, the original evaluator was adjusted to perform a more exact evaluation than the original one, and the hybrid system obtained F1 scores of 0.6480 and 0.5082 in the training and test sets, respectively.ConclusionsWe evaluated the impact of several factors on the performance of DL models, and we present an end-to-end system for extracting family history information from clinical notes, which can help in the structuring and reuse of this type of information. The final hybrid solution is provided in a publicly available code repository.

Project description:BackgroundThere is progress to be made in building artificially intelligent systems to detect abnormalities that are not only accurate but can handle the true breadth of findings that radiologists encounter in body (chest, abdomen, and pelvis) computed tomography (CT). Currently, the major bottleneck for developing multi-disease classifiers is a lack of manually annotated data. The purpose of this work was to develop high throughput multi-label annotators for body CT reports that can be applied across a variety of abnormalities, organs, and disease states thereby mitigating the need for human annotation.MethodsWe used a dictionary approach to develop rule-based algorithms (RBA) for extraction of disease labels from radiology text reports. We targeted three organ systems (lungs/pleura, liver/gallbladder, kidneys/ureters) with four diseases per system based on their prevalence in our dataset. To expand the algorithms beyond pre-defined keywords, attention-guided recurrent neural networks (RNN) were trained using the RBA-extracted labels to classify reports as being positive for one or more diseases or normal for each organ system. Alternative effects on disease classification performance were evaluated using random initialization or pre-trained embedding as well as different sizes of training datasets. The RBA was tested on a subset of 2158 manually labeled reports and performance was reported as accuracy and F-score. The RNN was tested against a test set of 48,758 reports labeled by RBA and performance was reported as area under the receiver operating characteristic curve (AUC), with 95% CIs calculated using the DeLong method.ResultsManual validation of the RBA confirmed 91-99% accuracy across the 15 different labels. Our models extracted disease labels from 261,229 radiology reports of 112,501 unique subjects. Pre-trained models outperformed random initialization across all diseases. As the training dataset size was reduced, performance was robust except for a few diseases with a relatively small number of cases. Pre-trained classification AUCs reached > 0.95 for all four disease outcomes and normality across all three organ systems.ConclusionsOur label-extracting pipeline was able to encompass a variety of cases and diseases in body CT reports by generalizing beyond strict rules with exceptional accuracy. The method described can be easily adapted to enable automated labeling of hospital-scale medical data sets for training image-based disease classifiers.

Dataset Information

Automatic Extraction of Lung Cancer Staging Information From Computed Tomography Reports: Deep Learning Approach.

Background

Objective

Methods

Results

Conclusions

Publications

Automatic Extraction of Lung Cancer Staging Information From Computed Tomography Reports: Deep Learning Approach.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets