Dataset Information

RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.

ABSTRACT: Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.

SUBMITTER: Kim JS

PROVIDER: S-EPMC5940243 | biostudies-literature | 2018 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.

Kim Ji-Sung JS Gao Xin X Rzhetsky Andrey A

PLoS computational biology 20180426 4

Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic info ...[more]

PMID: 29698408

Similar Datasets

Project description:BackgroundMonitoring progress toward population health equity goals requires developing robust disparity indicators. However, surveillance data gaps that result in undercounting racial and ethnic minority groups might influence the observed disparity measures.ObjectiveThis study aimed to assess the impact of missing race and ethnicity data in surveillance systems on disparity measures.MethodsWe explored variations in missing race and ethnicity information in reported annual chlamydia and gonorrhea diagnoses in the United States from 2007 to 2018 by state, year, reported sex, and infection. For diagnoses with incomplete demographic information in 2018, we estimated disparity measures (relative rate ratio and rate difference) with 5 imputation scenarios compared with the base case (no adjustments). The 5 scenarios used the racial and ethnic distribution of chlamydia or gonorrhea diagnoses in the same state, chlamydia or gonorrhea diagnoses in neighboring states, chlamydia or gonorrhea diagnoses within the geographic region, HIV diagnoses, and syphilis diagnoses.ResultsIn 2018, a total of 31.93% (560,551/1,755,510) of chlamydia and 22.11% (128,790/582,475) of gonorrhea diagnoses had missing race and ethnicity information. Missingness differed by infection type but not by reported sex. Missing race and ethnicity information varied widely across states and times (range across state-years: from 0.0% to 96.2%). The rate ratio remained similar in the imputation scenarios, although the rate difference differed nationally and in some states.ConclusionsWe found that missing race and ethnicity information affects measured disparities, which is important to consider when interpreting disparity metrics. Addressing missing information in surveillance systems requires system-level solutions, such as collecting more complete laboratory data, improving the linkage of data systems, and designing more efficient data collection procedures. As a short-term solution, local public health agencies can adapt these imputation scenarios to their aggregate data to adjust surveillance data for use in population indicators of health equity.

Project description:BackgroundDeep learning algorithms derived in homogeneous populations may be poorly generalizable and have the potential to reflect, perpetuate, and even exacerbate racial/ethnic disparities in health and health care. In this study, we aimed to (1) assess whether the performance of a deep learning algorithm designed to detect low left ventricular ejection fraction using the 12-lead ECG varies by race/ethnicity and to (2) determine whether its performance is determined by the derivation population or by racial variation in the ECG.MethodsWe performed a retrospective cohort analysis that included 97 829 patients with paired ECGs and echocardiograms. We tested the model performance by race/ethnicity for convolutional neural network designed to identify patients with a left ventricular ejection fraction ≤35% from the 12-lead ECG.ResultsThe convolutional neural network that was previously derived in a homogeneous population (derivation cohort, n=44 959; 96.2% non-Hispanic white) demonstrated consistent performance to detect low left ventricular ejection fraction across a range of racial/ethnic subgroups in a separate testing cohort (n=52 870): non-Hispanic white (n=44 524; area under the curve [AUC], 0.931), Asian (n=557; AUC, 0.961), black/African American (n=651; AUC, 0.937), Hispanic/Latino (n=331; AUC, 0.937), and American Indian/Native Alaskan (n=223; AUC, 0.938). In secondary analyses, a separate neural network was able to discern racial subgroup category (black/African American [AUC, 0.84], and white, non-Hispanic [AUC, 0.76] in a 5-class classifier), and a network trained only in non-Hispanic whites from the original derivation cohort performed similarly well across a range of racial/ethnic subgroups in the testing cohort with an AUC of at least 0.930 in all racial/ethnic subgroups.ConclusionsOur study demonstrates that while ECG characteristics vary by race, this did not impact the ability of a convolutional neural network to predict low left ventricular ejection fraction from the ECG. We recommend reporting of performance among diverse ethnic, racial, age, and sex groups for all new artificial intelligence tools to ensure responsible use of artificial intelligence in medicine.

Project description:Background:Higher socioeconomic status (SES) indicators such as educational attainment and income reduce the risk of chronic lung diseases (CLDs) such as Chronic Obstructive Pulmonary Disease (COPD), emphysema, chronic bronchitis, and asthma. Marginalization-related Diminished Returns (MDRs) refer to smaller health benefits of high SES for marginalized populations such as racial and ethnic minorities compared to the socially privileged groups such as non-Hispanic Whites. It is still unknown, however, if MDRs also apply to the effects of education and income on CLDs. Purpose:Using a nationally representative sample, the current study explored racial and ethnic variation in the associations between educational attainment and income and CLDs among American adults. Methods:In this study, we analyzed data (n = 25,659) from a nationally representative survey of American adults in 2013 and 2014. Wave one of the Population Assessment of Tobacco and Health (PATH)-Adult study was used. The independent variables were educational attainment (less than high school = 1, high school graduate = 2, and college graduate =3) and income (living out of poverty =1, living in poverty = 0). The dependent variable was any CLDs (i.e., COPD, emphysema, chronic bronchitis, and asthma). Age, gender, employment, and region were the covariates. Race and ethnicity were the moderators. Logistic regressions were fitted to analyze the data. Results:Individuals with higher educational attainment and those with higher income (who lived out of poverty) had lower odds of CLDs. Race and ethnicity showed statistically significant interactions with educational attainment and income, suggesting that the protective effects of high education and income on reducing odds of CLDs were smaller for Blacks and Hispanics than for non-Hispanic Whites. Conclusions:Education and income better reduce the risk of CLDs among Whites than Hispanics and Blacks. That means we should expect disproportionately higher than expected risk of CLDs in Hispanics and Blacks with high SES. Future research should test if high levels of environmental risk factors contribute to the high risk of CLDs in high income and highly educated Black and Hispanic Americans. Policy makers should not reduce health inequalities to SES gaps because disparities sustain across SES levels, with high SES Blacks and Hispanics remaining at risk of health problems.

Dataset Information

RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.

Publications

RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets