Dataset Information

Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer.

ABSTRACT:

Importance

Application of deep learning algorithms to whole-slide pathology images can potentially improve diagnostic accuracy and efficiency.

Objective

Assess the performance of automated deep learning algorithms at detecting metastases in hematoxylin and eosin-stained tissue sections of lymph nodes of women with breast cancer and compare it with pathologists' diagnoses in a diagnostic setting.

Design, setting, and participants

Researcher challenge competition (CAMELYON16) to develop automated solutions for detecting lymph node metastases (November 2015-November 2016). A training data set of whole-slide images from 2 centers in the Netherlands with (n = 110) and without (n = 160) nodal metastases verified by immunohistochemical staining were provided to challenge participants to build algorithms. Algorithm performance was evaluated in an independent test set of 129 whole-slide images (49 with and 80 without metastases). The same test set of corresponding glass slides was also evaluated by a panel of 11 pathologists with time constraint (WTC) from the Netherlands to ascertain likelihood of nodal metastases for each slide in a flexible 2-hour session, simulating routine pathology workflow, and by 1 pathologist without time constraint (WOTC).

Exposures

Deep learning algorithms submitted as part of a challenge competition or pathologist interpretation.

Main outcomes and measures

The presence of specific metastatic foci and the absence vs presence of lymph node metastasis in a slide or image using receiver operating characteristic curve analysis. The 11 pathologists participating in the simulation exercise rated their diagnostic confidence as definitely normal, probably normal, equivocal, probably tumor, or definitely tumor.

Results

The area under the receiver operating characteristic curve (AUC) for the algorithms ranged from 0.556 to 0.994. The top-performing algorithm achieved a lesion-level, true-positive fraction comparable with that of the pathologist WOTC (72.4% [95% CI, 64.3%-80.4%]) at a mean of 0.0125 false-positives per normal whole-slide image. For the whole-slide image classification task, the best algorithm (AUC, 0.994 [95% CI, 0.983-0.999]) performed significantly better than the pathologists WTC in a diagnostic simulation (mean AUC, 0.810 [range, 0.738-0.884]; P < .001). The top 5 algorithms had a mean AUC that was comparable with the pathologist interpreting the slides in the absence of time constraints (mean AUC, 0.960 [range, 0.923-0.994] for the top 5 algorithms vs 0.966 [95% CI, 0.927-0.998] for the pathologist WOTC).

Conclusions and relevance

In the setting of a challenge competition, some deep learning algorithms achieved better diagnostic performance than a panel of 11 pathologists participating in a simulation exercise designed to mimic routine pathology workflow; algorithm performance was comparable with an expert pathologist interpreting whole-slide images without time constraints. Whether this approach has clinical utility will require evaluation in a clinical setting.

SUBMITTER: Ehteshami Bejnordi B

PROVIDER: S-EPMC5820737 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Lymph node involvement increases the risk of breast cancer recurrence. An accurate non-invasive assessment of nodal involvement is valuable in cancer staging, surgical risk, and cost savings. Radiomics has been proposed to pre-operatively predict sentinel lymph node (SLN) status; however, radiomic models are known to be sensitive to acquisition parameters. The purpose of this study was to develop a prediction model for preoperative prediction of SLN metastasis using deep learning-based (DLB) features and compare its predictive performance to state-of-the-art radiomics. Specifically, this study aimed to compare the generalizability of radiomics vs DLB features in an independent test set with dissimilar resolution. Dynamic contrast-enhancement images from 198 patients (67 positive SLNs) were used in this study. Of these subjects, 163 had an in-plane resolution of 0.7 × 0.7 mm2, which were randomly divided into a training set (approximately 67%) and a validation set (approximately 33%). The remaining 35 subjects with a different in-plane resolution (0.78 × 0.78 mm2) were treated as independent testing set for generalizability. Two methods were employed: (1) conventional radiomics (CR), and (2) DLB features which replaced hand-curated features with pre-trained VGG-16 features. The threshold determined using the training set was applied to the independent validation and testing dataset. Same feature reduction, feature selection, model creation procedures were used for both approaches. In the validation set (same resolution as training), the DLB model outperformed the CR model (accuracy 83% vs 80%). Furthermore, in the independent testing set of the dissimilar resolution, the DLB model performed markedly better than the CR model (accuracy 77% vs 71%). The predictive performance of the DLB model outperformed the CR model for this task. More interestingly, these improvements were seen particularly in the independent testing set of dissimilar resolution. This could indicate that DLB features can ultimately result in a more generalizable model.