Dataset Information

Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India.

ABSTRACT: BACKGROUND:Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. OBJECTIVE:This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. METHODS:In the Kilkari impact evaluation's end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning-based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, "don't know" rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. RESULTS:Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. CONCLUSIONS:Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID):DERR1-10.2196/17619.

SUBMITTER: Shah N

PROVIDER: S-EPMC7439143 | biostudies-literature | 2020 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India.

Shah Neha N Mohan Diwakar D Bashingwa Jean Juste Harisson JJH Ummer Osama O Chakraborty Arpita A LeFevre Amnesty E AE

JMIR research protocols 20200805 8

<h4>Background</h4>Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monit ...[more]

PMID: 32755886

Similar Datasets

Project description:BackgroundMachine learning is a promising tool in the area of suicide prevention due to its ability to combine the effects of multiple risk factors and complex interactions. The power of machine learning has led to an influx of studies on suicide prediction, as well as a few recent reviews. Our study distinguished between data sources and reported the most important predictors of suicide outcomes identified in the literature.ObjectiveOur study aimed to identify studies that applied machine learning techniques to administrative and survey data, summarize performance metrics reported in those studies, and enumerate the important risk factors of suicidal thoughts and behaviors identified.MethodsA systematic literature search of PubMed, Medline, Embase, PsycINFO, Web of Science, Cumulative Index to Nursing and Allied Health Literature (CINAHL), and Allied and Complementary Medicine Database (AMED) to identify all studies that have used machine learning to predict suicidal thoughts and behaviors using administrative and survey data was performed. The search was conducted for articles published between January 1, 2019 and May 11, 2022. In addition, all articles identified in three recently published systematic reviews (the last of which included studies up until January 1, 2019) were retained if they met our inclusion criteria. The predictive power of machine learning methods in predicting suicidal thoughts and behaviors was explored using box plots to summarize the distribution of the area under the receiver operating characteristic curve (AUC) values by machine learning method and suicide outcome (i.e., suicidal thoughts, suicide attempt, and death by suicide). Mean AUCs with 95% confidence intervals (CIs) were computed for each suicide outcome by study design, data source, total sample size, sample size of cases, and machine learning methods employed. The most important risk factors were listed.ResultsThe search strategy identified 2,200 unique records, of which 104 articles met the inclusion criteria. Machine learning algorithms achieved good prediction of suicidal thoughts and behaviors (i.e., an AUC between 0.80 and 0.89); however, their predictive power appears to differ across suicide outcomes. The boosting algorithms achieved good prediction of suicidal thoughts, death by suicide, and all suicide outcomes combined, while neural network algorithms achieved good prediction of suicide attempts. The risk factors for suicidal thoughts and behaviors differed depending on the data source and the population under study.ConclusionThe predictive utility of machine learning for suicidal thoughts and behaviors largely depends on the approach used. The findings of the current review should prove helpful in preparing future machine learning models using administrative and survey data.Systematic review registrationhttps://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42022333454 identifier CRD42022333454.

Project description:Intrinsically disordered proteins (IDPs) and proteins with intrinsically disordered regions (IDRs) play important roles in many aspects of normal cell physiology, such as signal transduction and transcription, as well as pathological states, including Alzheimer's, Parkinson's, and Huntington's disease. Unlike their globular counterparts that are defined by a few structures and free energy minima, IDP/IDR comprise a large ensemble of rapidly interconverting structures and a corresponding free energy landscape characterized by multiple minima. This aspect has precluded the use of structural biological techniques, such as X-ray crystallography and nuclear magnetic resonance (NMR) for resolving their structures. Instead, low-resolution techniques, such as small-angle X-ray or neutron scattering (SAXS/SANS), have become a mainstay in characterizing coarse features of the ensemble of structures. These are typically complemented with NMR data if possible or computational techniques, such as atomistic molecular dynamics, to further resolve the underlying ensemble of structures. However, over the past 10-15 years, it has become evident that the classical, pairwise-additive force fields that have enjoyed a high degree of success for globular proteins have been somewhat limited in modeling IDP/IDR structures that agree with experiment. There has thus been a significant effort to rehabilitate these models to obtain better agreement with experiment, typically done by optimizing parameters in a piecewise fashion. In this work, we take a different approach by optimizing a set of force field parameters simultaneously, using machine learning to adapt force field parameters to experimental SAXS scattering profiles. We demonstrate our approach in modeling three biologically IDP ensembles based on experimental SAXS profiles and show that our optimization approach significantly improve force field parameters that generate ensembles in better agreement with experiment.

Project description:Background:Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions. Objective:To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile. Materials:An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child's treatment administrative cost. Methods:Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size. Results:Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms. Conclusions:We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.

Dataset Information

Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India.

Publications

Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets