Dataset Information

Comparison of large networks with sub-sampling strategies.

ABSTRACT: Networks are routinely used to represent large data sets, making the comparison of networks a tantalizing research question in many areas. Techniques for such analysis vary from simply comparing network summary statistics to sophisticated but computationally expensive alignment-based approaches. Most existing methods either do not generalize well to different types of networks or do not provide a quantitative similarity score between networks. In contrast, alignment-free topology based network similarity scores empower us to analyse large sets of networks containing different types and sizes of data. Netdis is such a score that defines network similarity through the counts of small sub-graphs in the local neighbourhood of all nodes. Here, we introduce a sub-sampling procedure based on neighbourhoods which links naturally with the framework of network comparisons through local neighbourhood comparisons. Our theoretical arguments justify basing the Netdis statistic on a sample of similar-sized neighbourhoods. Our tests on empirical and synthetic datasets indicate that often only 10% of the neighbourhoods of a network suffice for optimal performance, leading to a drastic reduction in computational requirements. The sampling procedure is applicable even when only a small sample of the network is known, and thus provides a novel tool for network comparison of very large and potentially incomplete datasets.

SUBMITTER: Ali W

PROVIDER: S-EPMC4933923 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:IntroductionIn the United States, tens of thousands of inspections of tobacco retailers are conducted each year. Various sampling choices can reduce travel costs, emphasize enforcement in areas with greater noncompliance, and allow for comparability between states and over time. We sought to develop a model sampling strategy for state tobacco retailer inspections.MethodsUsing a 2014 list of 10,161 North Carolina tobacco retailers, we compared results from simple random sampling; stratified, clustered at the ZIP code sampling; and, stratified, clustered at the census tract sampling. We conducted a simulation of repeated sampling and compared approaches for their comparative level of precision, coverage, and retailer dispersion.ResultsWhile maintaining an adequate design effect and statistical precision appropriate for a public health enforcement program, both stratified, clustered ZIP- and tract-based approaches were feasible. Both ZIP and tract strategies yielded improvements over simple random sampling, with relative improvements, respectively, of average distance between retailers (reduced 5.0% and 1.9%), percent Black residents in sampled neighborhoods (increased 17.2% and 32.6%), percent Hispanic residents in sampled neighborhoods (reduced 2.2% and increased 18.3%), percentage of sampled retailers located near schools (increased 61.3% and 37.5%), and poverty rate in sampled neighborhoods (increased 14.0% and 38.2%).ConclusionsStates can make retailer inspections more efficient and targeted with stratified, clustered sampling. Use of statistically appropriate sampling strategies like these should be considered by states, researchers, and the Food and Drug Administration to improve program impact and allow for comparisons over time and across states.ImplicationsThe authors present a model tobacco retailer sampling strategy for promoting compliance and reducing costs that could be used by US states and the Food and Drug Administration (FDA). The design is feasible to implement in North Carolina. Use of the sampling design would help document the impact of FDA's compliance and enforcement program, save money, and emphasize inspections in areas where they are needed most. FDA should consider requiring probability-based sampling in their inspections contracts with states and private contractors.

Dataset Information

Comparison of large networks with sub-sampling strategies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets