Dataset Information

Cross-validation under separate sampling: strong bias and how to correct it.

ABSTRACT:

Motivation

It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics.

Results

We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.

Availability and implementation

The source code in C++, along with the Supplementary Materials, is available at: http://gsp.tamu.edu/Publications/supplementary/zollanvari13/.

SUBMITTER: Braga-Neto UM

PROVIDER: S-EPMC4296143 | biostudies-literature | 2014 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Cross-validation under separate sampling: strong bias and how to correct it.

Braga-Neto Ulisses M UM Zollanvari Amin A Dougherty Edward R ER

Bioinformatics (Oxford, England) 20140813 23

<h4>Motivation</h4>It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics.<h4>Results</h4>We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depe ...[more]

PMID: 25123902

Similar Datasets

Project description:The rumen is a complex ecosystem that plays a critical role in our efforts to improve feed efficiency of cattle and reduce their environmental impacts. Sequencing of the 16S rRNA gene provides a powerful tool to survey the bacterial and some archaeal. Oral stomach tubing a cow to collect a rumen sample is a rapid, cost-effective alternative to rumen cannulation for acquiring rumen samples. In this study, we determined how sampling method (oral stomach tubing vs cannulated grab sample), as well as rumen fraction type (liquid vs solid), bias the bacterial and archaeal communities observed. Liquid samples were further divided into liquid strained through cheesecloth and unstrained. Fecal samples were also collected to determine how these differed from the rumen sample types. The abundance of major archaeal communities was not different at the family level in samples acquired via rumen cannula or stomach tube. In contrast to the stable archaeal communities across sample type, the bacterial order WCHB1-41 (phylum Kiritimatiellaeota) was enriched in both liquid strained and unstrained samples as well as the family Prevotellaceae as compared to grab samples. However, these liquid samples had significantly lower abundance of Lachnospiraceae compared with grab samples. Solid samples strained of rumen liquid most closely resembled the grab samples containing both rumen liquid and solid particles obtained directly from the rumen cannula; therefore, inclusion of particulate matter is important for an accurate representation of the rumen bacteria. Stomach tube samples were the most variable and were most representative of the liquid phase. In comparison with a grab sample, stomach tube samples had significantly lower abundance of Lachnospiraceae, Fibrobacter and Treponema. Fecal samples did not reflect the community composition of the rumen, as fecal samples had significantly higher relative abundance of Ruminococcaceae and significantly lower relative abundance of Lachnospiraceae compared with grab samples.

Project description:ObjectivesTo determine whether studying aetiological pathways of depression, in particular the well-established determinant of childhood trauma, only in a specialised mental healthcare setting can yield biased estimates of the aetiological association, given that the majority of individuals are treated in primary care settings.Design and settingTwo databanks were used in this study. The Canadian Community Health Survey (CCHS) on Mental Health and Well-Being 2012 is a national survey about mental health of adult Canadians. It measured common mental disorders and utilisation of services. The Signature mental health biobank includes adults from the Island of Montreal recruited at the emergency department of a major university mental health centre. After consent, participants filled standardised psychosocial questionnaires, gave blood samples, and their clinical diagnosis was recorded. We compared the cohort of depressed individuals from CCHS and Signature in contact with specialised services with those in contact with primary care or not in treatment.ParticipantsThere were 860 participants with depression in the CCHS and 207 participants with depression in the Signature Bank.Primary and secondary outcomesThe Childhood Experiences of Violence Questionnaire was used to measure childhood trauma in both settings. Childhood trauma is associated with depression as with other common mental and physical disorders.ResultsIndividuals with depression in the CCHS who reported having been hospitalised for psychiatric treatment or having seen a psychiatrist or those from Signature were found to be more strongly associated with childhood abuse than individuals with depression who were treated in primary care settings or did not seek mental healthcare in the preceding year.ConclusionsBerkson's bias limits the generalisability of aetiological associations observed in such university-hospital-based biobanks, but the problem can be remedied by broadening recruitment to primary care settings and the general population.

Dataset Information

Cross-validation under separate sampling: strong bias and how to correct it.

Motivation

Results

Availability and implementation

Publications

Cross-validation under separate sampling: strong bias and how to correct it.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets