Dataset Information

UNEXPECTED PROPERTIES OF BANDWIDTH CHOICE WHEN SMOOTHING DISCRETE DATA FOR CONSTRUCTING A FUNCTIONAL DATA CLASSIFIER.

ABSTRACT: The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is not effective in classification problems. There a useful rule of thumb is that undersmoothing is often desirable, but there are several surprising qualifications to that approach. First, the effect of smoothing the training data can be more significant than that of smoothing the new data set to be classified; second, undersmoothing is not always the right approach, and in fact in some cases using a relatively large bandwidth can be more effective; and third, these perverse results are the consequence of very unusual properties of error rates, expressed as functions of smoothing parameters. For example, the orders of magnitude of optimal smoothing parameter choices depend on the signs and sizes of terms in an expansion of error rate, and those signs and sizes can vary dramatically from one setting to another, even for the same classifier.

SUBMITTER: Carroll RJ

PROVIDER: S-EPMC4191932 | biostudies-literature | 2013 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

UNEXPECTED PROPERTIES OF BANDWIDTH CHOICE WHEN SMOOTHING DISCRETE DATA FOR CONSTRUCTING A FUNCTIONAL DATA CLASSIFIER.

Carroll Raymond J RJ Delaigle Aurore A Hall Peter P

Annals of statistics 20131201 6

The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is no ...[more]

PMID: 25309640

Similar Datasets

Project description:BackgroundDigital technological development in the last 20 years has led to significant growth in digital collection, use, and sharing of health data. To maintain public trust in the digital society and to enable acceptable policy-making in the future, it is important to investigate people's preferences for sharing digital health data.ObjectiveThe aim of this study is to elicit the preferences of the public in different Northern European countries (the United Kingdom, Norway, Iceland, and Sweden) for sharing health information in different contexts.MethodsRespondents in this discrete choice experiment completed several choice tasks, in which they were asked if data sharing in the described hypothetical situation was acceptable to them. Latent class logistic regression models were used to determine attribute-level estimates and heterogeneity in preferences. We calculated the relative importance of the attributes and the predicted acceptability for different contexts in which the data were shared from the estimates.ResultsIn the final analysis, we used 37.83% (1967/5199) questionnaires. All attributes influenced the respondents' willingness to share health information (P<.001). The most important attribute was whether the respondents were informed about their data being shared. The possibility of opting out from sharing data was preferred over the opportunity to consent (opt-in). Four classes were identified in the latent class model, and the average probabilities of belonging were 27% for class 1, 32% for class 2, 23% for class 3, and 18% for class 4. The uptake probability varied between 14% and 85%, depending on the least to most preferred combination of levels.ConclusionsRespondents from different countries have different preferences for sharing their health data regarding the value of a review process and the reason for their new use. Offering respondents information about the use of their data and the possibility to opt out is the most preferred governance mechanism.

Project description:BACKGROUND:The use of shotgun metagenomics to analyse low-complexity microbial communities in foods has the potential to be of considerable fundamental and applied value. However, there is currently no consensus with respect to choice of species classification tool, platform, or sequencing depth. Here, we benchmarked the performances of three high-throughput short-read sequencing platforms, the Illumina MiSeq, NextSeq 500, and Ion Proton, for shotgun metagenomics of food microbiota. Briefly, we sequenced six kefir DNA samples and a mock community DNA sample, the latter constructed by evenly mixing genomic DNA from 13 food-related bacterial species. A variety of bioinformatic tools were used to analyse the data generated, and the effects of sequencing depth on these analyses were tested by randomly subsampling reads. RESULTS:Compositional analysis results were consistent between the platforms at divergent sequencing depths. However, we observed pronounced differences in the predictions from species classification tools. Indeed, PERMANOVA indicated that there was no significant differences between the compositional results generated by the different sequencers (p = 0.693, R2 = 0.011), but there was a significant difference between the results predicted by the species classifiers (p = 0.01, R2 = 0.127). The relative abundances predicted by the classifiers, apart from MetaPhlAn2, were apparently biased by reference genome sizes. Additionally, we observed varying false-positive rates among the classifiers. MetaPhlAn2 had the lowest false-positive rate, whereas SLIMM had the greatest false-positive rate. Strain-level analysis results were also similar across platforms. Each platform correctly identified the strains present in the mock community, but accuracy was improved slightly with greater sequencing depth. Notably, PanPhlAn detected the dominant strains in each kefir sample above 500,000 reads per sample. Again, the outputs from functional profiling analysis using SUPER-FOCUS were generally accordant between the platforms at different sequencing depths. Finally, and expectedly, metagenome assembly completeness was significantly lower on the MiSeq than either on the NextSeq (p = 0.03) or the Proton (p = 0.011), and it improved with increased sequencing depth. CONCLUSIONS:Our results demonstrate a remarkable similarity in the results generated by the three sequencing platforms at different sequencing depths, and, in fact, the choice of bioinformatics methodology had a more evident impact on results than the choice of sequencer did.

Project description:Multilevel functional data is collected in many biomedical studies. For example, in a study of the effect of Nimodipine on patients with subarachnoid hemorrhage (SAH), patients underwent multiple 4-hour treatment cycles. Within each treatment cycle, subjects' vital signs were reported every 10 minutes. This data has a natural multilevel structure with treatment cycles nested within subjects and measurements nested within cycles. Most literature on nonparametric analysis of such multilevel functional data focus on conditional approaches using functional mixed effects models. However, parameters obtained from the conditional models do not have direct interpretations as population average effects. When population effects are of interest, we may employ marginal regression models. In this work, we propose marginal approaches to fit multilevel functional data through penalized spline generalized estimating equation (penalized spline GEE). The procedure is effective for modeling multilevel correlated generalized outcomes as well as continuous outcomes without suffering from numerical difficulties. We provide a variance estimator robust to misspecification of correlation structure. We investigate the large sample properties of the penalized spline GEE estimator with multilevel continuous data and show that the asymptotics falls into two categories. In the small knots scenario, the estimated mean function is asymptotically efficient when the true correlation function is used and the asymptotic bias does not depend on the working correlation matrix. In the large knots scenario, both the asymptotic bias and variance depend on the working correlation. We propose a new method to select the smoothing parameter for penalized spline GEE based on an estimate of the asymptotic mean squared error (MSE). We conduct extensive simulation studies to examine property of the proposed estimator under different correlation structures and sensitivity of the variance estimation to the choice of smoothing parameter. Finally, we apply the methods to the SAH study to evaluate a recent debate on discontinuing the use of Nimodipine in the clinical community.

Dataset Information

UNEXPECTED PROPERTIES OF BANDWIDTH CHOICE WHEN SMOOTHING DISCRETE DATA FOR CONSTRUCTING A FUNCTIONAL DATA CLASSIFIER.

Publications

UNEXPECTED PROPERTIES OF BANDWIDTH CHOICE WHEN SMOOTHING DISCRETE DATA FOR CONSTRUCTING A FUNCTIONAL DATA CLASSIFIER.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets