Dataset Information

Smoothing splines approximation using Hilbert curve basis selection.

ABSTRACT: Smoothing splines have been used pervasively in nonparametric regressions. However, the computational burden of smoothing splines is significant when the sample size n is large. When the number of predictors d ≥ 2 , the computational cost for smoothing splines is at the order of O(n ³) using the standard approach. Many methods have been developed to approximate smoothing spline estimators by using q basis functions instead of n ones, resulting in a computational cost of the order O(nq ²). These methods are called the basis selection methods. Despite algorithmic benefits, most of the basis selection methods require the assumption that the sample is uniformly-distributed on a hyper-cube. These methods may have deteriorating performance when such an assumption is not met. To overcome the obstacle, we develop an efficient algorithm that is adaptive to the unknown probability density function of the predictors. Theoretically, we show the proposed estimator has the same convergence rate as the full-basis estimator when q is roughly at the order of O[n ^{2d/{(pr+1)(d +2)}}] , where p ∈[1, 2] and r ≈ 4 are some constants depend on the type of the spline. Numerical studies on various synthetic datasets demonstrate the superior performance of the proposed estimator in comparison with mainstream competitors.

SUBMITTER: Meng C

PROVIDER: S-EPMC9674117 | biostudies-literature | 2022

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Smoothing splines approximation using Hilbert curve basis selection.

Meng Cheng C Yu Jun J Chen Yongkai Y Zhong Wenxuan W Ma Ping P

Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America 20220112 3

Smoothing splines have been used pervasively in nonparametric regressions. However, the computational burden of smoothing splines is significant when the sample size n is large. When the number of predictors d ≥ 2 , the computational cost for smoothing splines is at the order of O(n 3) using the standard approach. Many methods have been developed to approximate smoothing spline estimators by using q basis functions instead of n ones, resulting ...[more]

PMID: 36407675

Similar Datasets

Project description:BackgroundNext-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. We proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows.ResultsWe performed simulation studies using a frequency-based approach to generate the read and shadow counts directly, which can mimic the real sequence counts data structure. Using simulation, we investigated the performance of the proposed approach and compared it to that of shadow linear regression. The proposed approach provided more accurate error rate estimations than the shadow linear regression approach for all the scenarios tested. We also applied the proposed approach to assess the error rates for the sequence data from the MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project, and bacteriophage PhiX DNA samples.ConclusionsThe proposed empirical error rate estimation approach does not assume a linear relationship between the error-free read and shadow counts and provides more accurate estimations of error rates for next-generation, short-read sequencing data.

Dataset Information

Smoothing splines approximation using Hilbert curve basis selection.

Publications

Smoothing splines approximation using Hilbert curve basis selection.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets