Dataset Information

A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations.

ABSTRACT:

Background

With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking.

Results

The major contribution of this paper is to present a unified methodology which allows many common (statistical) response models to be fitted to such data sets. The class of models includes virtually any model with a linear predictor in it, for example (but not limited to), multiclass logistic regression (classification), generalised linear models (regression) and survival models. A fast algorithm for finding sparse well fitting models is presented. The ideas are illustrated on real data sets with numbers of variables ranging from thousands to millions. R code implementing the ideas is available for download.

Conclusion

The method described in this paper enables existing work on response models when there are less variables than observations to be leveraged to the situation when there are many more variables than observations. It is a powerful approach to finding parsimonious models for such datasets. The method is capable of handling problems with millions of variables and a large variety of response types within the one framework. The method compares favourably to existing methods such as support vector machines and random forests, but has the advantage of not requiring separate variable selection steps. It is also works for data types which these methods were not designed to handle. The method usually produces very sparse models which make biological interpretation simpler and more focused.

SUBMITTER: Kiiveri HT

PROVIDER: S-EPMC2390543 | biostudies-literature | 2008 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations.

Kiiveri Harri T HT

BMC bioinformatics 20080415

<h4>Background</h4>With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking.<h4 ...[more]

PMID: 18410693

Similar Datasets

Project description:Vascular leiomyosarcomas are a rare subtype of leiomyosarcomas that most commonly affect the inferior vena cava and account for 5% of all leiomyosarcomas. These tumors are aggressive malignant tumors for which adjuvant modalities have not shown increased efficacy compared with surgery.To evaluate the outcomes of patients with vascular leiomyosarcoma and the association between vascular leiomyosarcomas and immunohistochemical molecular markers, to determine their potential prognostic and therapeutic utility.Retrospective medical record review of a cohort of 77 patients who presented to the University of Texas MD Anderson Cancer Center in Houston during the period from January 1993 to April 2012. Data were analyzed during the period from November 2012 to May 2015. All of the patients received a confirmed diagnosis of vascular leiomyosarcoma. Immunohistochemical studies for biomarkers were performed on a tissue microarray that included 26 primary specimens of vascular leiomyosarcoma.Demographic and clinical factors were evaluated to assess clinical course, patterns of recurrence, and survival outcomes for patients with primary vascular leiomyosarcoma. A univariate Cox proportional hazards model was used to correlate disease-specific survival and time to recurrence with potential prognostic indicators.Sixty-three patients with localized disease who underwent surgical resection formed the study population, and their data were used for subsequent outcomes analysis. The median age at diagnosis was 58 years (range, 22-78 years). The majority of patients were female (41 patients [65%]) and white (51 patients [81%]). The 5-year disease-specific survival rate after tumor resection was 65%. The median time to local recurrence was 43 months, the median time to distant recurrence was 25 months, and the median time to concurrent local and distant recurrences was 15 months (P =?.04). Strong expressions of cytoplasmic ?-catenin (hazard ratio, 5.33 [95% CI, 0.97-29.30]; P =?.06) and insulinlike growth factor 1 receptor (hazard ratio, 2.74 [95% CI, 1.14-6.56]; P =?.02) were associated with inferior disease-specific survival.Vascular leiomyosarcomas are aggressive malignant tumors, with high recurrence rates. Expressions of ?-catenin and insulinlike growth factor 1 receptor were associated with poor disease-specific survival. Prospective studies should evaluate the clinical and therapeutic utility of these molecular markers.

Dataset Information

A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations.

Background

Results

Conclusion

Publications

A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets