Dataset Information

Using statistical and machine learning to help institutions detect suspicious access to electronic health records.

ABSTRACT:

Objective

To determine whether statistical and machine-learning methods, when applied to electronic health record (EHR) access data, could help identify suspicious (ie, potentially inappropriate) access to EHRs.

Methods

From EHR access logs and other organizational data collected over a 2-month period, the authors extracted 26 features likely to be useful in detecting suspicious accesses. Selected events were marked as either suspicious or appropriate by privacy officers, and served as the gold standard set for model evaluation. The authors trained logistic regression (LR) and support vector machine (SVM) models on 10-fold cross-validation sets of 1291 labeled events. The authors evaluated the sensitivity of final models on an external set of 58 events that were identified as truly inappropriate and investigated independently from this study using standard operating procedures.

Results

The area under the receiver operating characteristic curve of the models on the whole data set of 1291 events was 0.91 for LR, and 0.95 for SVM. The sensitivity of the baseline model on this set was 0.8. When the final models were evaluated on the set of 58 investigated events, all of which were determined as truly inappropriate, the sensitivity was 0 for the baseline method, 0.76 for LR, and 0.79 for SVM.

Limitations

The LR and SVM models may not generalize because of interinstitutional differences in organizational structures, applications, and workflows. Nevertheless, our approach for constructing the models using statistical and machine-learning techniques can be generalized. An important limitation is the relatively small sample used for the training set due to the effort required for its construction.

Conclusion

The results suggest that statistical and machine-learning methods can play an important role in helping privacy officers detect suspicious accesses to EHRs.

SUBMITTER: Boxwala AA

PROVIDER: S-EPMC3128412 | biostudies-literature | 2011 Jul-Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Using statistical and machine learning to help institutions detect suspicious access to electronic health records.

Boxwala Aziz A AA Kim Jihoon J Grillo Janice M JM Ohno-Machado Lucila L

Journal of the American Medical Informatics Association : JAMIA 20110701 4

<h4>Objective</h4>To determine whether statistical and machine-learning methods, when applied to electronic health record (EHR) access data, could help identify suspicious (ie, potentially inappropriate) access to EHRs.<h4>Methods</h4>From EHR access logs and other organizational data collected over a 2-month period, the authors extracted 26 features likely to be useful in detecting suspicious accesses. Selected events were marked as either suspicious or appropriate by privacy officers, and serv ...[more]

PMID: 21672912

Similar Datasets

Project description:Semi-quantitative scoring schemes like the Consortium to Establish a Registry for Alzheimer's Disease (CERAD) are the most commonly used method in Alzheimer's disease (AD) neuropathology practice. Computational approaches based on machine learning have recently generated quantitative scores for whole slide images (WSIs) that are highly correlated with human derived semi-quantitative scores, such as those of CERAD, for Alzheimer's disease pathology. However, the robustness of such models have yet to be tested in different cohorts. To validate previously published machine learning algorithms using convolutional neural networks (CNNs) and determine if pathological heterogeneity may alter algorithm derived measures, 40 cases from the Goizueta Emory Alzheimer's Disease Center brain bank displaying an array of pathological diagnoses (including AD with and without Lewy body disease (LBD), and / or TDP-43-positive inclusions) and levels of Aβ pathologies were evaluated. Furthermore, to provide deeper phenotyping, amyloid burden in gray matter vs whole tissue were compared, and quantitative CNN scores for both correlated significantly to CERAD-like scores. Quantitative scores also show clear stratification based on AD pathologies with or without additional diagnoses (including LBD and TDP-43 inclusions) vs cases with no significant neurodegeneration (control cases) as well as NIA Reagan scoring criteria. Specifically, the concomitant diagnosis group of AD + TDP-43 showed significantly greater CNN-score for cored plaques than the AD group. Finally, we report that whole tissue computational scores correlate better with CERAD-like categories than focusing on computational scores from a field of view with densest pathology, which is the standard of practice in neuropathological assessment per CERAD guidelines. Together these findings validate and expand CNN models to be robust to cohort variations and provide additional proof-of-concept for future studies to incorporate machine learning algorithms into neuropathological practice.

Project description:IntroductionPredictive models have been used to aid early diagnosis of PCOS, though existing models are based on small sample sizes and limited to fertility clinic populations. We built a predictive model using machine learning algorithms based on an outpatient population at risk for PCOS to predict risk and facilitate earlier diagnosis, particularly among those who meet diagnostic criteria but have not received a diagnosis.MethodsThis is a retrospective cohort study from a SafetyNet hospital's electronic health records (EHR) from 2003-2016. The study population included 30,601 women aged 18-45 years without concurrent endocrinopathy who had any visit to Boston Medical Center for primary care, obstetrics and gynecology, endocrinology, family medicine, or general internal medicine. Four prediction outcomes were assessed for PCOS. The first outcome was PCOS ICD-9 diagnosis with additional model outcomes of algorithm-defined PCOS. The latter was based on Rotterdam criteria and merging laboratory values, radiographic imaging, and ICD data from the EHR to define irregular menstruation, hyperandrogenism, and polycystic ovarian morphology on ultrasound.ResultsWe developed predictive models using four machine learning methods: logistic regression, supported vector machine, gradient boosted trees, and random forests. Hormone values (follicle-stimulating hormone, luteinizing hormone, estradiol, and sex hormone binding globulin) were combined to create a multilayer perceptron score using a neural network classifier. Prediction of PCOS prior to clinical diagnosis in an out-of-sample test set of patients achieved an average AUC of 85%, 81%, 80%, and 82%, respectively in Models I, II, III and IV. Significant positive predictors of PCOS diagnosis across models included hormone levels and obesity; negative predictors included gravidity and positive bHCG.ConclusionMachine learning algorithms were used to predict PCOS based on a large at-risk population. This approach may guide early detection of PCOS within EHR-interfaced populations to facilitate counseling and interventions that may reduce long-term health consequences. Our model illustrates the potential benefits of an artificial intelligence-enabled provider assistance tool that can be integrated into the EHR to reduce delays in diagnosis. However, model validation in other hospital-based populations is necessary.

Project description:BackgroundElectronic health records provide the opportunity to identify undiagnosed individuals likely to have a given disease using machine learning techniques, and who could then benefit from more medical screening and case finding, reducing the number needed to screen with convenience and healthcare cost savings. Ensemble machine learning models combining multiple prediction estimates into one are often said to provide better predictive performances than non-ensemble models. Yet, to our knowledge, no literature review summarises the use and performances of different types of ensemble machine learning models in the context of medical pre-screening.MethodWe aimed to conduct a scoping review of the literature reporting the derivation of ensemble machine learning models for screening of electronic health records. We searched EMBASE and MEDLINE databases across all years applying a formal search strategy using terms related to medical screening, electronic health records and machine learning. Data were collected, analysed, and reported in accordance with the PRISMA scoping review guideline.ResultsA total of 3355 articles were retrieved, of which 145 articles met our inclusion criteria and were included in this study. Ensemble machine learning models were increasingly employed across several medical specialties and often outperformed non-ensemble approaches. Ensemble machine learning models with complex combination strategies and heterogeneous classifiers often outperformed other types of ensemble machine learning models but were also less used. Ensemble machine learning models methodologies, processing steps and data sources were often not clearly described.ConclusionsOur work highlights the importance of deriving and comparing the performances of different types of ensemble machine learning models when screening electronic health records and underscores the need for more comprehensive reporting of machine learning methodologies employed in clinical research.

Dataset Information

Using statistical and machine learning to help institutions detect suspicious access to electronic health records.

Objective

Methods

Results

Limitations

Conclusion

Publications

Using statistical and machine learning to help institutions detect suspicious access to electronic health records.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets