Dataset Information

Early detection of COVID-19 in the UK using self-reported symptoms: a large-scale, prospective, epidemiological surveillance study.

ABSTRACT:

Background

Self-reported symptoms during the COVID-19 pandemic have been used to train artificial intelligence models to identify possible infection foci. To date, these models have only considered the culmination or peak of symptoms, which is not suitable for the early detection of infection. We aimed to estimate the probability of an individual being infected with SARS-CoV-2 on the basis of early self-reported symptoms to enable timely self-isolation and urgent testing.

Methods

In this large-scale, prospective, epidemiological surveillance study, we used prospective, observational, longitudinal, self-reported data from participants in the UK on 19 symptoms over 3 days after symptoms onset and COVID-19 PCR test results extracted from the COVID-19 Symptom Study mobile phone app. We divided the study population into a training set (those who reported symptoms between April 29, 2020, and Oct 15, 2020) and a test set (those who reported symptoms between Oct 16, 2020, and Nov 30, 2020), and used three models to analyse the self-reported symptoms: the UK's National Health Service (NHS) algorithm, logistic regression, and the hierarchical Gaussian process model we designed to account for several important variables (eg, specific COVID-19 symptoms, comorbidities, and clinical information). Model performance to predict COVID-19 positivity was compared in terms of sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) in the test set. For the hierarchical Gaussian process model, we also evaluated the relevance of symptoms in the early detection of COVID-19 in population subgroups stratified according to occupation, sex, age, and body-mass index.

Findings

The training set comprised 182 991 participants and the test set comprised 15 049 participants. When trained on 3 days of self-reported symptoms, the hierarchical Gaussian process model had a higher prediction AUC (0·80 [95% CI 0·80-0·81]) than did the logistic regression model (0·74 [0·74-0·75]) and the NHS algorithm (0·67 [0·67-0·67]). AUCs for all models increased with the number of days of self-reported symptoms, but were still high for the hierarchical Gaussian process model at day 1 (0·73 [95% CI 0·73-0·74]) and day 2 (0·79 [0·78-0·79]). At day 3, the hierarchical Gaussian process model also had a significantly higher sensitivity, but a non-statistically lower specificity, than did the two other models. The hierarchical Gaussian process model also identified different sets of relevant features to detect COVID-19 between younger and older subgroups, and between health-care workers and non-health-care workers. When used during different pandemic periods, the model was robust to changes in populations.

Interpretation

Early detection of SARS-CoV-2 infection is feasible with our model. Such early detection is crucial to contain the spread of COVID-19 and efficiently allocate medical resources.

Funding

ZOE, the UK Government Department of Health and Social Care, the Wellcome Trust, the UK Engineering and Physical Sciences Research Council, the UK National Institute for Health Research, the UK Medical Research Council, the British Heart Foundation, the Alzheimer's Society, the Chronic Disease Research Foundation, and the Massachusetts Consortium on Pathogen Readiness.

SUBMITTER: Canas LS

PROVIDER: S-EPMC8321433 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Seasonal influenza surveillance is usually carried out by sentinel general practitioners (GPs) who compile weekly reports based on the number of influenza-like illness (ILI) clinical cases observed among visited patients. This traditional practice for surveillance generally presents several issues, such as a delay of one week or more in releasing reports, population biases in the health-seeking behaviour, and the lack of a common definition of ILI case. On the other hand, the availability of novel data streams has recently led to the emergence of non-traditional approaches for disease surveillance that can alleviate these issues. In Europe, a participatory web-based surveillance system called Influenzanet represents a powerful tool for monitoring seasonal influenza epidemics thanks to aid of self-selected volunteers from the general population who monitor and report their health status through Internet-based surveys, thus allowing a real-time estimate of the level of influenza circulating in the population. In this work, we propose an unsupervised probabilistic framework that combines time series analysis of self-reported symptoms collected by the Influenzanet platforms and performs an algorithmic detection of groups of symptoms, called syndromes. The aim of this study is to show that participatory web-based surveillance systems are capable of detecting the temporal trends of influenza-like illness even without relying on a specific case definition. The methodology was applied to data collected by Influenzanet platforms over the course of six influenza seasons, from 2011-2012 to 2016-2017, with an average of 34,000 participants per season. Results show that our framework is capable of selecting temporal trends of syndromes that closely follow the ILI incidence rates reported by the traditional surveillance systems in the various countries (Pearson correlations ranging from 0.69 for Italy to 0.88 for the Netherlands, with the sole exception of Ireland with a correlation of 0.38). The proposed framework was able to forecast quite accurately the ILI trend of the forthcoming influenza season (2016-2017) based only on the available information of the previous years (2011-2016). Furthermore, to broaden the scope of our approach, we applied it both in a forecasting fashion to predict the ILI trend of the 2016-2017 influenza season (Pearson correlations ranging from 0.60 for Ireland and UK, and 0.85 for the Netherlands) and also to detect gastrointestinal syndrome in France (Pearson correlation of 0.66). The final result is a near-real-time flexible surveillance framework not constrained by any specific case definition and capable of capturing the heterogeneity in symptoms circulation during influenza epidemics in the various European countries.

Project description:BackgroundAs many countries seek to slow the spread of COVID-19 without reimposing national restrictions, it has become important to track the disease at a local level to identify areas in need of targeted intervention.MethodsIn this prospective, observational study, we did modelling using longitudinal, self-reported data from users of the COVID Symptom Study app in England between March 24, and Sept 29, 2020. Beginning on April 28, in England, the Department of Health and Social Care allocated RT-PCR tests for COVID-19 to app users who logged themselves as healthy at least once in 9 days and then reported any symptom. We calculated incidence of COVID-19 using the invited swab (RT-PCR) tests reported in the app, and we estimated prevalence using a symptom-based method (using logistic regression) and a method based on both symptoms and swab test results. We used incidence rates to estimate the effective reproduction number, R(t), modelling the system as a Poisson process and using Markov Chain Monte-Carlo. We used three datasets to validate our models: the Office for National Statistics (ONS) Community Infection Survey, the Real-time Assessment of Community Transmission (REACT-1) study, and UK Government testing data. We used geographically granular estimates to highlight regions with rapidly increasing case numbers, or hotspots.FindingsFrom March 24 to Sept 29, 2020, a total of 2 873 726 users living in England signed up to use the app, of whom 2 842 732 (98·9%) provided valid age information and daily assessments. These users provided a total of 120 192 306 daily reports of their symptoms, and recorded the results of 169 682 invited swab tests. On a national level, our estimates of incidence and prevalence showed a similar sensitivity to changes to those reported in the ONS and REACT-1 studies. On Sept 28, 2020, we estimated an incidence of 15 841 (95% CI 14 023-17 885) daily cases, a prevalence of 0·53% (0·45-0·60), and R(t) of 1·17 (1·15-1·19) in England. On a geographically granular level, on Sept 28, 2020, we detected 15 (75%) of the 20 regions with highest incidence according to government test data.InterpretationOur method could help to detect rapid case increases in regions where government testing provision is lower. Self-reported data from mobile applications can provide an agile resource to inform policy makers during a quickly moving pandemic, serving as a complementary resource to more traditional instruments for disease surveillance.FundingZoe Global, UK Government Department of Health and Social Care, Wellcome Trust, UK Engineering and Physical Sciences Research Council, UK National Institute for Health Research, UK Medical Research Council and British Heart Foundation, Alzheimer's Society, Chronic Disease Research Foundation.

Project description:BackgroundUK Biobank is a large prospective cohort study containing accelerometer-based physical activity data with strong validity collected from 100,000 participants approximately 5 years after baseline. In contrast, the main cohort has multiple self-reported physical behaviours from > 500,000 participants with longer follow-up time, offering several epidemiological advantages. However, questionnaire methods typically suffer from greater measurement error, and at present there is no tested method for combining these diverse self-reported data to more comprehensively assess the overall dose of physical activity. This study aimed to use the accelerometry sub-cohort to calibrate the self-reported behavioural variables to produce a harmonised estimate of physical activity energy expenditure, and subsequently examine its reliability, validity, and associations with disease outcomes.MethodsWe calibrated 14 self-reported behavioural variables from the UK Biobank main cohort using the wrist accelerometry sub-cohort (n = 93,425), and used published equations to estimate physical activity energy expenditure (PAEESR). For comparison, we estimated physical activity based on the scoring criteria of the International Physical Activity Questionnaire, and by summing variables for occupational and leisure-time physical activity with no calibration. Test-retest reliability was assessed using data from the UK Biobank repeat assessment (n = 18,905) collected a mean of 4.3 years after baseline. Validity was assessed in an independent validation study (n = 98) with estimates based on doubly labelled water (PAEEDLW). In the main UK Biobank cohort (n = 374,352), Cox regression was used to estimate associations between PAEESR and fatal and non-fatal outcomes including all-cause, cardiovascular diseases, respiratory diseases, and cancers.ResultsPAEESR explained 27% variance in gold-standard PAEEDLW estimates, with no mean bias. However, error was strongly correlated with PAEEDLW (r = -.98; p < 0.001), and PAEESR had narrower range than the criterion. Test-retest reliability (Λ = .67) and relative validity (Spearman = .52) of PAEESR outperformed two common approaches for processing self-report data with no calibration. Predictive validity was demonstrated by associations with morbidity and mortality, e.g. 14% (95%CI: 11-17%) lower mortality for individuals meeting lower physical activity guidelines.ConclusionsThe PAEESR variable has good reliability and validity for ranking individuals, with no mean bias but correlated error at individual-level. PAEESR outperformed uncalibrated estimates and showed stronger inverse associations with disease outcomes.

Project description:BackgroundUS military engagements have consistently raised concern over the array of health outcomes experienced by service members postdeployment. Exploratory factor analysis has been used in studies of 1991 Gulf War-related illnesses, and may increase understanding of symptoms and health outcomes associated with current military conflicts in Iraq and Afghanistan. The objective of this study was to use exploratory factor analysis to describe the correlations among numerous physical and psychological symptoms in terms of a smaller number of unobserved variables or factors.MethodsThe Millennium Cohort Study collects extensive self-reported health data from a large, population-based military cohort, providing a unique opportunity to investigate the interrelationships of numerous physical and psychological symptoms among US military personnel. This study used data from the Millennium Cohort Study, a large, population-based military cohort. Exploratory factor analysis was used to examine the covariance structure of symptoms reported by approximately 50,000 cohort members during 2004-2006. Analyses incorporated 89 symptoms, including responses to several validated instruments embedded in the questionnaire. Techniques accommodated the categorical and sometimes incomplete nature of the survey data.ResultsA 14-factor model accounted for 60 percent of the total variance in symptoms data and included factors related to several physical, psychological, and behavioral constructs. A notable finding was that many factors appeared to load in accordance with symptom co-location within the survey instrument, highlighting the difficulty in disassociating the effects of question content, location, and response format on factor structure.ConclusionsThis study demonstrates the potential strengths and weaknesses of exploratory factor analysis to heighten understanding of the complex associations among symptoms. Further research is needed to investigate the relationship between factor analytic results and survey structure, as well as to assess the relationship between factor scores and key exposure variables.