Dataset Information

Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms.

ABSTRACT:

Importance

Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives.

Objective

To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms.

Design, setting, and participants

In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016.

Main outcomes and measurements

Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists' specificity with radiologists' sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists' recall assessment was developed and evaluated.

Results

Overall, 144 231 screening mammograms from 85 580 US women (952 cancer positive ≤12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166 578 examinations from 68 008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists' sensitivity, lower than community-practice radiologists' specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity.

Conclusions and relevance

While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.

SUBMITTER: Schaffter T

PROVIDER: S-EPMC7052735 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:ImportanceA computer algorithm that performs at or above the level of radiologists in mammography screening assessment could improve the effectiveness of breast cancer screening.ObjectiveTo perform an external evaluation of 3 commercially available artificial intelligence (AI) computer-aided detection algorithms as independent mammography readers and to assess the screening performance when combined with radiologists.Design, setting, and participantsThis retrospective case-control study was based on a double-reader population-based mammography screening cohort of women screened at an academic hospital in Stockholm, Sweden, from 2008 to 2015. The study included 8805 women aged 40 to 74 years who underwent mammography screening and who did not have implants or prior breast cancer. The study sample included 739 women who were diagnosed as having breast cancer (positive) and a random sample of 8066 healthy controls (negative for breast cancer).Main outcomes and measuresPositive follow-up findings were determined by pathology-verified diagnosis at screening or within 12 months thereafter. Negative follow-up findings were determined by a 2-year cancer-free follow-up. Three AI computer-aided detection algorithms (AI-1, AI-2, and AI-3), sourced from different vendors, yielded a continuous score for the suspicion of cancer in each mammography examination. For a decision of normal or abnormal, the cut point was defined by the mean specificity of the first-reader radiologists (96.6%).ResultsThe median age of study participants was 60 years (interquartile range, 50-66 years) for 739 women who received a diagnosis of breast cancer and 54 years (interquartile range, 47-63 years) for 8066 healthy controls. The cases positive for cancer comprised 618 (84%) screen detected and 121 (16%) clinically detected within 12 months of the screening examination. The area under the receiver operating curve for cancer detection was 0.956 (95% CI, 0.948-0.965) for AI-1, 0.922 (95% CI, 0.910-0.934) for AI-2, and 0.920 (95% CI, 0.909-0.931) for AI-3. At the specificity of the radiologists, the sensitivities were 81.9% for AI-1, 67.0% for AI-2, 67.4% for AI-3, 77.4% for first-reader radiologist, and 80.1% for second-reader radiologist. Combining AI-1 with first-reader radiologists achieved 88.6% sensitivity at 93.0% specificity (abnormal defined by either of the 2 making an abnormal assessment). No other examined combination of AI algorithms and radiologists surpassed this sensitivity level.Conclusions and relevanceTo our knowledge, this study is the first independent evaluation of several AI computer-aided detection algorithms for screening mammography. The results of this study indicated that a commercially available AI computer-aided detection algorithm can assess screening mammograms with a sufficient diagnostic performance to be further evaluated as an independent reader in prospective clinical trials. Combining the first readers with the best algorithm identified more cases positive for cancer than combining the first readers with second readers.

Project description:ImportanceExpert-level artificial intelligence (AI) algorithms for prostate biopsy grading have recently been developed. However, the potential impact of integrating such algorithms into pathologist workflows remains largely unexplored.ObjectiveTo evaluate an expert-level AI-based assistive tool when used by pathologists for the grading of prostate biopsies.Design, setting, and participantsThis diagnostic study used a fully crossed multiple-reader, multiple-case design to evaluate an AI-based assistive tool for prostate biopsy grading. Retrospective grading of prostate core needle biopsies from 2 independent medical laboratories in the US was performed between October 2019 and January 2020. A total of 20 general pathologists reviewed 240 prostate core needle biopsies from 240 patients. Each pathologist was randomized to 1 of 2 study cohorts. The 2 cohorts reviewed every case in the opposite modality (with AI assistance vs without AI assistance) to each other, with the modality switching after every 10 cases. After a minimum 4-week washout period for each batch, the pathologists reviewed the cases for a second time using the opposite modality. The pathologist-provided grade group for each biopsy was compared with the majority opinion of urologic pathology subspecialists.ExposureAn AI-based assistive tool for Gleason grading of prostate biopsies.Main outcomes and measuresAgreement between pathologists and subspecialists with and without the use of an AI-based assistive tool for the grading of all prostate biopsies and Gleason grade group 1 biopsies.ResultsBiopsies from 240 patients (median age, 67 years; range, 39-91 years) with a median prostate-specific antigen level of 6.5 ng/mL (range, 0.6-97.0 ng/mL) were included in the analyses. Artificial intelligence-assisted review by pathologists was associated with a 5.6% increase (95% CI, 3.2%-7.9%; P < .001) in agreement with subspecialists (from 69.7% for unassisted reviews to 75.3% for assisted reviews) across all biopsies and a 6.2% increase (95% CI, 2.7%-9.8%; P = .001) in agreement with subspecialists (from 72.3% for unassisted reviews to 78.5% for assisted reviews) for grade group 1 biopsies. A secondary analysis indicated that AI assistance was also associated with improvements in tumor detection, mean review time, mean self-reported confidence, and interpathologist agreement.Conclusions and relevanceIn this study, the use of an AI-based assistive tool for the review of prostate biopsies was associated with improvements in the quality, efficiency, and consistency of cancer detection and grading.

Project description:ObjectivesTo present a framework to develop and implement a fast-track artificial intelligence (AI) curriculum into an existing radiology residency program, with the potential to prepare a new generation of AI conscious radiologists.MethodsThe AI-curriculum framework comprises five sequential steps: (1) forming a team of AI experts, (2) assessing the residents' knowledge level and needs, (3) defining learning objectives, (4) matching these objectives with effective teaching strategies, and finally (5) implementing and evaluating the pilot. Following these steps, a multidisciplinary team of AI engineers, radiologists, and radiology residents designed a 3-day program, including didactic lectures, hands-on laboratory sessions, and group discussions with experts to enhance AI understanding. Pre- and post-curriculum surveys were conducted to assess participants' expectations and progress and were analyzed using a Wilcoxon rank-sum test.ResultsThere was 100% response rate to the pre- and post-curriculum survey (17 and 12 respondents, respectively). Participants' confidence in their knowledge and understanding of AI in radiology significantly increased after completing the program (pre-curriculum means 3.25 ± 1.48 (SD), post-curriculum means 6.5 ± 0.90 (SD), p-value = 0.002). A total of 75% confirmed that the course addressed topics that were applicable to their work in radiology. Lectures on the fundamentals of AI and group discussions with experts were deemed most useful.ConclusionDesigning an AI curriculum for radiology residents and implementing it into a radiology residency program is feasible using the framework presented. The 3-day AI curriculum effectively increased participants' perception of knowledge and skills about AI in radiology and can serve as a starting point for further customization.Critical relevance statementThe framework provides guidance for developing and implementing an AI curriculum in radiology residency programs, educating residents on the application of AI in radiology and ultimately contributing to future high-quality, safe, and effective patient care.Key points• AI education is necessary to prepare a new generation of AI-conscious radiologists. • The AI curriculum increased participants' perception of AI knowledge and skills in radiology. • This five-step framework can assist integrating AI education into radiology residency programs.

Project description:Breast ultrasound provides a first-line evaluation for breast masses, but the majority of the world lacks access to any form of diagnostic imaging. In this pilot study, we assessed the combination of artificial intelligence (Samsung S-Detect for Breast) with volume sweep imaging (VSI) ultrasound scans to evaluate the possibility of inexpensive, fully automated breast ultrasound acquisition and preliminary interpretation without an experienced sonographer or radiologist. This study was conducted using examinations from a curated data set from a previously published clinical study of breast VSI. Examinations in this data set were obtained by medical students without prior ultrasound experience who performed VSI using a portable Butterfly iQ ultrasound probe. Standard of care ultrasound exams were performed concurrently by an experienced sonographer using a high-end ultrasound machine. Expert-selected VSI images and standard of care images were input into S-Detect which output mass features and classification as "possibly benign" and "possibly malignant." Subsequent comparison of the S-Detect VSI report was made between 1) the standard of care ultrasound report by an expert radiologist, 2) the standard of care ultrasound S-Detect report, 3) the VSI report by an expert radiologist, and 4) the pathological diagnosis. There were 115 masses analyzed by S-Detect from the curated data set. There was substantial agreement of the S-Detect interpretation of VSI among cancers, cysts, fibroadenomas, and lipomas to the expert standard of care ultrasound report (Cohen's κ = 0.73 (0.57-0.9 95% CI), p<0.0001), the standard of care ultrasound S-Detect interpretation (Cohen's κ = 0.79 (0.65-0.94 95% CI), p<0.0001), the expert VSI ultrasound report (Cohen's κ = 0.73 (0.57-0.9 95% CI), p<0.0001), and the pathological diagnosis (Cohen's κ = 0.80 (0.64-0.95 95% CI), p<0.0001). All pathologically proven cancers (n = 20) were designated as "possibly malignant" by S-Detect with a sensitivity of 100% and specificity of 86%. Integration of artificial intelligence and VSI could allow both acquisition and interpretation of ultrasound images without a sonographer and radiologist. This approach holds potential for increasing access to ultrasound imaging and therefore improving outcomes related to breast cancer in low- and middle- income countries.

Project description:BackgroundThyroid cancer is the most common endocrine cancer in the world. Accurately distinguishing between benign and malignant thyroid nodules is particularly important for the early diagnosis and treatment of thyroid cancer. This study aimed to investigate the best possible optimization strategies for an already-trained artificial intelligence (AI)-based automated diagnostic system for thyroid nodule screening and, in addition, to scrutinize the clinically relevant limitations using stratified analysis to better standardize the application in clinical workflows.MethodsWe retrospectively reviewed a total of 1,092 ultrasound images associated with 397 thyroid nodules collected from 287 patients between April 2019 and January 2021, applying postoperative pathology as the gold standard. We applied different statistical approaches, including averages, maximums, and percentiles, to estimate per-nodule-based malignancy scores from the malignancy scores per image predicted by AI-SONIC Thyroid v. 5.3.0.2 (Demetics Medical Technology Ltd., Hangzhou, China) system, and we assessed its diagnostic efficacy on nodules of different sizes or tumor types with per-nodule analysis using performance metrics.ResultsOf the 397 thyroid nodules, 272 thyroid nodules were overrepresented by malignant nodules according to the results of the surgical pathological examinations. Taking the median of the malignancy scores per image to estimate the nodule-based score with a cutoff value of 0.56 optimized for the means of sensitivity and specificity, the AI-based automated detection system demonstrated slightly lower sensitivity, significantly higher specificity (almost independent of nodule size), and similar accuracy to that of the senior radiologist. Both the AI system and the senior radiologist demonstrated higher sensitivity in diagnosing smaller nodules (≤25 mm) and comparable diagnostic performances for larger nodules. The mean diagnostic time per nodule of the AI system was 0.146 s, which was in sharp contrast to the 2.8 to 4.5 min of the radiologists.ConclusionsUsing our optimization strategy to achieve nodule-based diagnosis, the AI-SONIC Thyroid automated diagnostic system demonstrated an overall diagnostic accuracy equivalent to that of senior radiologists. Thus, it is expected that it can be used as a reliable auxiliary diagnostic method by radiologists for the screening and preoperative evaluation of malignant thyroid nodules.

Project description:BackgroundCorneal topography is a clinically validated examination method for keratoconus. However, there is no clear guideline regarding patient selection for corneal topography. We developed and validated a novel artificial intelligence (AI) model to identify patients who would benefit from corneal topography based on basic ophthalmologic examinations, including a survey of visual impairment, best-corrected visual acuity (BCVA) measurement, intraocular pressure (IOP) measurement, and autokeratometry.MethodsA total of five AI models (three individual models with fully connected neural network including the XGBoost, and the TabNet models, and two ensemble models with hard and soft voting methods) were trained and validated. We used three datasets collected from the records of 2,613 patients' basic ophthalmologic examinations from two institutions to train and validate the AI models. We trained the AI models using a dataset from a third medical institution to determine whether corneal topography was needed to detect keratoconus. Finally, prospective intra-validation dataset (internal test dataset) and extra-validation dataset from a different medical institution (external test dataset) were used to assess the performance of the AI models.ResultsThe ensemble model with soft voting method outperformed all other AI models in sensitivity when predicting which patients needed corneal topography (90.5% in internal test dataset and 96.4% in external test dataset). In the error analysis, most of the predicting error occurred within the range of the subclinical keratoconus and the suspicious D-score in the Belin-Ambrósio enhanced ectasia display. In the feature importance analysis, out of 18 features, IOP was the highest ranked feature when comparing the average value of the relative attributions of three individual AI models, followed by the difference in the value of mean corneal power.ConclusionAn AI model using the results of basic ophthalmologic examination has the potential to recommend corneal topography for keratoconus. In this AI algorithm, IOP and the difference between the two eyes, which may be undervalued clinical information, were important factors in the success of the AI model, and may be worth further reviewing in research and clinical practice for keratoconus screening.

Project description:ImportanceContemporary approaches to artificial intelligence (AI) based on deep learning have generated interest in the application of AI to breast cancer screening (BCS). The US Food and Drug Administration (FDA) has approved several next-generation AI products indicated for BCS in recent years; however, questions regarding their accuracy, appropriate use, and clinical utility remain.ObjectivesTo describe the current FDA regulatory process for AI products, summarize the evidence used to support FDA clearance and approval of AI products indicated for BCS, consider the advantages and limitations of current regulatory approaches, and suggest ways to improve the current system.Evidence reviewPremarket notifications and other publicly available documents used for FDA clearance and approval of AI products indicated for BCS from January 1, 2017, to December 31, 2021.FindingsNine AI products indicated for BCS for identification of suggestive lesions and mammogram triage were included. Most of the products had been cleared through the 510(k) pathway, and all clearances were based on previously collected retrospective data; 6 products used multicenter designs; 7 products used enriched data; and 4 lacked details on whether products were externally validated. Test performance measures, including sensitivity, specificity, and area under the curve, were the main outcomes reported. Most of the devices used tissue biopsy as the criterion standard for BCS accuracy evaluation. Other clinical outcome measures, including cancer stage at diagnosis and interval cancer detection, were not reported for any of the devices.Conclusions and relevanceThe findings of this review suggest important gaps in reporting of data sources, data set type, validation approach, and clinical utility assessment. As AI-assisted reading becomes more widespread in BCS and other radiologic examinations, strengthened FDA evidentiary regulatory standards, development of postmarketing surveillance, a focus on clinically meaningful outcomes, and stakeholder engagement will be critical for ensuring the safety and efficacy of these products.