Dataset Information

Predicting neighborhoods' socioeconomic attributes using restaurant data.

ABSTRACT: Accessing high-resolution, timely socioeconomic data such as data on population, employment, and enterprise activity at the neighborhood level is critical for social scientists and policy makers to design and implement location-based policies. However, in many developing countries or cities, reliable local-scale socioeconomic data remain scarce. Here, we show an easily accessible and timely updated location attribute-restaurant-can be used to accurately predict a range of socioeconomic attributes of urban neighborhoods. We merge restaurant data from an online platform with 3 microdatasets for 9 Chinese cities. Using features extracted from restaurants, we train machine-learning models to estimate daytime and nighttime population, number of firms, and consumption level at various spatial resolutions. The trained model can explain 90 to 95% of the variation of those attributes across neighborhoods in the test dataset. We analyze the tradeoff between accuracy, spatial resolution, and number of training samples, as well as the heterogeneity of the predicted results across different spatial locations, demographics, and firm industries. Finally, we demonstrate the cross-city generality of this method by training the model in one city and then applying it directly to other cities. The transferability of this restaurant model can help bridge data gaps between cities, allowing all cities to enjoy big data and algorithm dividends.

SUBMITTER: Dong L

PROVIDER: S-EPMC6681720 | biostudies-literature | 2019 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Predicting neighborhoods' socioeconomic attributes using restaurant data.

Dong Lei L Ratti Carlo C Zheng Siqi S

Proceedings of the National Academy of Sciences of the United States of America 20190715 31

Accessing high-resolution, timely socioeconomic data such as data on population, employment, and enterprise activity at the neighborhood level is critical for social scientists and policy makers to design and implement location-based policies. However, in many developing countries or cities, reliable local-scale socioeconomic data remain scarce. Here, we show an easily accessible and timely updated location attribute-restaurant-can be used to accurately predict a range of socioeconomic attribute ...[more]

PMID: 31308232

Similar Datasets

Project description:Species establishment within a community depends on their interactions with the local environment and resident community. Such environmental and biotic filtering is frequently inferred from functional trait and phylogenetic patterns within communities; these patterns may also predict which additional species can establish. However, differentiating between environmental and biotic filtering can be challenging, which may complicate establishment predictions. Creating a habitat-specific species pool by identifying which absent species within the region can establish in the focal habitat allows us to isolate biotic filtering by modeling dissimilarity between the observed and biotically excluded species able to pass environmental filters. Similarly, modeling the dissimilarity between the habitat-specific species pool and the environmentally excluded species within the region can isolate local environmental filters. Combined, these models identify potentially successful phenotypes and why certain phenotypes were unsuccessful. Here, we present a framework that uses the functional dissimilarity among these groups in logistic models to predict establishment of additional species. This approach can use multivariate trait distances and phylogenetic information, but is most powerful when using individual traits and their interactions. It also requires an appropriate distance-based dissimilarity measure, yet the two most commonly used indices, nearest neighbor (one species) and mean pairwise (all species) distances, may inaccurately predict establishment. By iteratively increasing the number of species used to measure dissimilarity, a functional neighborhood can be chosen that maximizes the detection of underlying trait patterns. We tested this framework using two seed addition experiments in calcareous grasslands. Although the functional neighborhood size that best fits the community's trait structure depended on the type of filtering considered, selecting these functional neighborhood sizes allowed our framework to predict up to 50% of the variation in actual establishment from seed. These results indicate that the proposed framework may be a powerful tool for studying and predicting species establishment.

Project description:Over recent decades, machine learning, an integral subfield of artificial intelligence, has revolutionized diverse sectors, enabling data-driven decisions with minimal human intervention. In particular, the field of educational assessment emerges as a promising area for machine learning applications, where students can be classified and diagnosed using their performance data. The objectives of Diagnostic Classification Models (DCMs), which provide a suite of methods for diagnosing students' cognitive states in relation to the mastery of necessary cognitive attributes for solving problems in a test, can be effectively addressed through machine learning techniques. However, the challenge lies in the latent nature of cognitive status, which makes it difficult to obtain labels for the training dataset. Consequently, the application of machine learning methods to DCMs often assumes smaller training sets with labels derived either from theoretical considerations or human experts. In this study, the authors propose a supervised diagnostic classification model with data augmentation (SDCM-DA). This method is designed to utilize the augmented data using a data generation model constructed by leveraging the probability of correct responses for each attribute mastery pattern derived from the expert-labeled dataset. To explore the benefits of data augmentation, a simulation study is carried out, contrasting it with classification methods that rely solely on the expert-labeled dataset for training. The findings reveal that utilizing data augmentation with the estimated probabilities of correct responses substantially enhances classification accuracy. This holds true even when the augmentation originates from a small labeled sample with occasional labeling errors, and when the tests contain lower-quality items that may inaccurately measure students' true cognitive status. Moreover, the study demonstrates that leveraging augmented data for learning can enable the successful classification of students, thereby eliminating the necessity for specifying an underlying response model.

Project description:BackgroundIdeally, health services and interventions to improve dental health should be tailored to local target populations. But this is not the standard. Little is known about risk clusters in dental health care and their evaluation based on small-scale, spatial data, particularly among under-represented groups in health surveys. Our study aims to investigate the incidence rates of major oral diseases among privately insured and self-paying individuals in Germany, explore the spatial clustering of these diseases, and evaluate the influence of social determinants on oral disease risk clusters using advanced data analysis techniques, i.e. machine learning.MethodsA retrospective cohort study was performed to calculate the age- and sex-standardized incidence rate of oral diseases in a study population of privately insured and self-pay patients in Germany who received dental treatment between 2016 and 2021. This was based on anonymized claims data from BFS health finance, Bertelsmann, Dortmund, Germany. The disease history of individuals was recorded and aggregated at the ZIP code 5 level (n = 8871).ResultsStatistically significant, spatially compact clusters and relative risks (RR) of incidence rates were identified. By linking disease and socioeconomic databases on the ZIP-5 level, local risk models for each disease were estimated based on spatial-neighborhood variables using different machine learning models. We found that dental diseases were spatially clustered among privately insured and self-payer patients in Germany. Incidence rates within clusters were significantly elevated compared to incidence rates outside clusters. The relative risks (RR) for a new dental disease in primary risk clusters were min = 1.3 (irreversible pulpitis; 95%-CI = 1.3-1.3) and max = 2.7 (periodontitis; 95%-CI = 2.6-2.8), depending on the disease. Despite some similarity in the importance of variables from machine learning models across different clusters, each cluster is unique and must be treated as such when addressing oral public health threats.ConclusionsOur study analyzed the incidence of major oral diseases in Germany and employed spatial methods to identify and characterize high-risk clusters for targeted interventions. We found that private claims data, combined with a network-based, data-driven approach, can effectively pinpoint areas and factors relevant to oral healthcare, including socioeconomic determinants like income and occupational status. The methodology presented here enables the identification of disease clusters of greatest demand, which would allow implementing more targeted approaches and improve access to quality care where they can have the most impact.

Project description:IntroductionHungary has a single payer health insurance system offering free healthcare for acute cerebrovascular disorders. Within the capital, Budapest, however there are considerable microregional socioeconomic differences. We hypothesized that socioeconomic deprivation reflects in less favorable stroke characteristics despite universal access to care.MethodsFrom the database of the National Health Insurance Fund, we identified 4779 patients hospitalized between 2002 and 2007 for acute cerebrovascular disease (hereafter ACV, i.e. ischemic stroke, intracerebral hemorrhage, or transient ischemia), among residents of the poorest (District 8, n = 2618) and the wealthiest (District 12, n = 2161) neighborhoods of Budapest. Follow-up was until March 2013.ResultsMean age at onset of ACV was 70±12 and 74±12 years for District 8 and 12 (p<0.01). Age-standardized incidence was higher in District 8 than in District 12 (680/100,000/year versus 518/100,000/year for ACV and 486/100,000/year versus 259/100,000/year for ischemic stroke). Age-standardized mortality of ACV overall and of ischemic stroke specifically was 157/100,000/year versus 100/100,000/year and 122/100,000/year versus 75/100,000/year for District 8 and 12. Long-term case fatality (at 1,5, and 10 years) for ACV and for ischemic stroke was higher in younger District 8 residents (41-70 years of age at the index event) compared to D12 residents of the same age. This gap between the districts increased with the length of follow-up. Of the risk diseases the prevalence of hypertension and diabetes was higher in District 8 than in District 12 (75% versus 66%, p<0.001; and 26% versus 16%, p<0.001).DiscussionDespite universal healthcare coverage, the disadvantaged district has higher ACV incidence and mortality than the wealthier neighborhood. This difference affects primarily the younger age groups. Long-term follow-up data suggest that inequity in institutional rehabilitation and home-care should be investigated and improved in disadvantaged neighborhoods.

Project description:ImportanceIdentifying early childhood behavioral problems associated with economic success/failure is essential for the development of targeted interventions that enhance economic prosperity through improved educational attainment and social integration.ObjectiveTo test the association between kindergarten teacher-rated assessments of inattention, hyperactivity, opposition, aggression, and prosociality in boys with their employment earnings at age 35 to 36 years as measured by government tax return data.Design, setting, and participantsA 30-year prospective follow-up study analyzing low socioeconomic neighborhoods in Montreal, Quebec, Canada. Boys aged 5 to 6 years attending kindergarten in low socioeconomic neighborhoods were recruited. Teacher-rated behavioral assessments were obtained for 1040 boys. Data were collected from April 1984 to December 2015. Analysis began January 2017.Main outcomes and measuresMixed-effects linear regression models were used to examine the association between teacher ratings of inattention, hyperactivity, opposition, aggression, and prosociality at age 6 years and individual earnings obtained from government tax returns at age 35 to 36 years. The IQ of the child and family adversity were adjusted for in the analysis.ResultsComplete data were available for 920 study participants (mean age at follow-up was 36.3 years). Mean (SD) personal earnings at follow-up were $28 865.53 ($24 103.45) (range, $0-$142 267.84). A 1-unit increase in inattention (mean [SD], 2.66 [2.34]; range, 0-8) at age 6 years was associated with decrease in earnings at age 35 to 36 years of $1295.13 (95% CI, -$2051.65 to -$538.62), while a unit increase in prosociality (mean [SD], 8.0 [4.96]; range, 0-20) was associated with an increase in earnings of $406.15 (95% CI, $172.54-$639.77). Hyperactivity, opposition, and aggression were not significantly associated with earnings. Child IQ was associated with higher earnings and family adversity with lower earnings in all models. A 1-SD reduction in inattention at age 6 years was associated with a theoretical increase in annual earnings of $3040.41, a similar magnitude to an equivalent increase in IQ.Conclusions and relevanceTeacher ratings of inattention and prosociality in kindergarten boys from low socioeconomic neighborhoods are associated with earnings in adulthood after adjustment for hyperactivity, aggression, and opposition, which were not associated with earnings. Interventions beginning in kindergarten that target boys' inattention and enhance prosociality could positively impact workforce integration and earnings.

Dataset Information

Predicting neighborhoods' socioeconomic attributes using restaurant data.

Publications

Predicting neighborhoods' socioeconomic attributes using restaurant data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets