Dataset Information

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).

ABSTRACT:

Objective

To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses.

Materials and methods

Using an original data set (n=1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip-code level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated.

Results

In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean=2.9±2.4; max=16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n=171) and for all unsuppressed zip codes (n=5,819), respectively. In small sample sizes, synthetic data utility was notably decreased.

Discussion

Analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations were less similar and had more data suppression.

Conclusion

In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression -an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.

SUBMITTER: Thomas JA

PROVIDER: S-EPMC8282114 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:ImportanceSARS-CoV-2.ObjectiveTo determine the characteristics, changes over time, outcomes, and severity risk factors of SARS-CoV-2 affected children within the National COVID Cohort Collaborative (N3C).DesignProspective cohort study of patient encounters with end dates before May 27th, 2021.Setting45 N3C institutions.ParticipantsChildren <19-years-old at initial SARS-CoV-2 testing.Main outcomes and measuresCase incidence and severity over time, demographic and comorbidity severity risk factors, vital sign and laboratory trajectories, clinical outcomes, and acute COVID-19 vs MIS-C contrasts for children infected with SARS-CoV-2.Results728,047 children in the N3C were tested for SARS-CoV-2; of these, 91,865 (12.6%) were positive. Among the 5,213 (6%) hospitalized children, 685 (13%) met criteria for severe disease: mechanical ventilation (7%), vasopressor/inotropic support (7%), ECMO (0.6%), or death/discharge to hospice (1.1%). Male gender, African American race, older age, and several pediatric complex chronic condition (PCCC) subcategories were associated with higher clinical severity (p ≤ 0.05). Vital signs (all p≤0.002) and many laboratory tests from the first day of hospitalization were predictive of peak disease severity. Children with severe (vs moderate) disease were more likely to receive antimicrobials (71% vs 32%, p<0.001) and immunomodulatory medications (53% vs 16%, p<0.001). Compared to those with acute COVID-19, children with MIS-C were more likely to be male, Black/African American, 1-to-12-years-old, and less likely to have asthma, diabetes, or a PCCC (p < 0.04). MIS-C cases demonstrated a more inflammatory laboratory profile and more severe clinical phenotype with higher rates of invasive ventilation (12% vs 6%) and need for vasoactive-inotropic support (31% vs 6%) compared to acute COVID-19 cases, respectively (p<0.03).ConclusionsIn the largest U.S. SARS-CoV-2-positive pediatric cohort to date, we observed differences in demographics, pre-existing comorbidities, and initial vital sign and laboratory test values between severity subgroups. Taken together, these results suggest that early identification of children likely to progress to severe disease could be achieved using readily available data elements from the day of admission. Further work is needed to translate this knowledge into improved outcomes.

Project description:RationaleNontuberculous mycobacteria (NTM) are ubiquitous environmental microorganisms. Infection is thought to result primarily from exposure to soil and/or water sources. NTM disease prevalence varies greatly by geographic region, but the geospatial factors influencing this variation remain unclear.ObjectivesTo identify sociodemographic and environmental ecological risk factors associated with NTM infection and disease in Colorado.MethodsWe conducted an ecological study, combining data from patients with a diagnosis of NTM disease from National Jewish Health's electronic medical record database and ZIP code-level sociodemographic and environmental exposure data obtained from the U.S. Geological Survey, the U.S. Department of Agriculture, and the U.S. Census Bureau. We used spatial scan methods to identify high-risk clusters of NTM disease in Colorado. Ecological risk factors for disease were assessed using Bayesian generalized linear models assuming Poisson-distributed discrete responses (case counts by ZIP code) with the log link function.ResultsWe identified two statistically significant high-risk clusters of disease. The primary cluster included ZIP codes in urban regions of Denver and Aurora, as well as regions south of Denver, on the east side of the Continental Divide. The secondary cluster was located on the west side of the Continental Divide in rural and mountainous regions. After adjustment for sociodemographic, drive time, and soil variables, we identified three watershed areas with relative risks of 12.2, 4.6, and 4.2 for slowly growing NTM infections compared with the mean disease risk for all watersheds in Colorado. This study population carries with it inherent limitations that may introduce bias. The lack of complete capture of NTM cases in Colorado may be related to factors such as disease severity, education and income levels, and insurance status.ConclusionsOur findings provide evidence that water derived from particular watersheds may be an important source of NTM exposure in Colorado. The watershed with the greatest risk of NTM disease contains the Dillon Reservoir. This reservoir is also the main water supply for major cities located in the two watersheds with the second and third highest disease risk in the state, suggesting an important possible source of infection.