Dataset Information

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis.

ABSTRACT:

Background

One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions.

Results

With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same a-priori information, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites.

Conclusions

We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.

SUBMITTER: Keilwagen J

PROVIDER: S-EPMC2859755 | biostudies-literature | 2010 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis.

Keilwagen Jens J Grau Jan J Posch Stefan S Grosse Ivo I

BMC bioinformatics 20100322

<h4>Background</h4>One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal ...[more]

PMID: 20307305

Similar Datasets

Project description:Rapid advancements in sequencing technologies along with falling costs present widespread opportunities for microbiome studies across a vast and diverse array of environments. These impressive technological developments have been accompanied by a considerable growth in the number of methodological variables, including sampling, storage, DNA extraction, primer pairs, sequencing technology, chemistry version, read length, insert size, and analysis pipelines, amongst others. This increase in variability threatens to compromise both the reproducibility and the comparability of studies conducted. Here we perform the first reported study comparing both amplicon and shotgun sequencing for the three leading next-generation sequencing technologies. These were applied to six human stool samples using Illumina HiSeq, MiSeq and Ion PGM shotgun sequencing, as well as amplicon sequencing across two variable 16S rRNA gene regions. Notably, we found that the factor responsible for the greatest variance in microbiota composition was the chosen methodology rather than the natural inter-individual variance, which is commonly one of the most significant drivers in microbiome studies. Amplicon sequencing suffered from this to a large extent, and this issue was particularly apparent when the 16S rRNA V1-V2 region amplicons were sequenced with MiSeq. Somewhat surprisingly, the choice of taxonomic binning software for shotgun sequences proved to be of crucial importance with even greater discriminatory power than sequencing technology and choice of amplicon. Optimal N50 assembly values for the HiSeq was obtained for 10 million reads per sample, whereas the applied MiSeq and PGM sequencing depths proved less sufficient for shotgun sequencing of stool samples. The latter technologies, on the other hand, provide a better basis for functional gene categorisation, possibly due to their longer read lengths. Hence, in addition to highlighting methodological biases, this study demonstrates the risks associated with comparing data generated using different strategies. We also recommend that laboratories with particular interests in certain microbes should optimise their protocols to accurately detect these taxa using different techniques.

Project description:ObjectivesTo determine the extent of agreement between four commonly used definitions of multiple chronic conditions (MCCs) and compare each definition's ability to predict 30-day hospital readmissions.DesignRetrospective cohort study.SettingNational Medicare claims data.ParticipantsRandom sample of Medicare beneficiaries discharged from the hospital from 2005 to 2009 (n = 710,609).MeasurementsBaseline chronic conditions were determined for each participant using four definitions of MCC. The primary outcome was all-cause 30-day hospital readmission. Agreement between MCC definitions was measured, and sensitivities and specificities for each definition's ability to identify patients experiencing a future readmission were calculated. Logistic regression was used to assess the ability of each MCC definition to predict 30-day hospital readmission.ResultsThe sample prevalence of hospitalized Medicare beneficiaries with two or more chronic conditions ranged from 18.6% (Johns Hopkins Adjusted Clinical Groups (ACG) Case-Mix System software) to 92.9% (Medicare Chronic Condition Warehouse (CCW)). There was slight to moderate agreement (kappa = 0.03-0.44) between pair-wise combinations of MCC definitions. CCW-defined MCC was the most sensitive (sensitivity 95.4%, specificity 7.4%), and ACG-defined MCC was the most specific (sensitivity 32.7%, specificity 83.2%) predictor of being readmitted. In the fully adjusted model, the risk of readmission was higher for those with chronic condition Special Needs Plan (c-SNP)-defined MCCs (odds ratio (OR) = 1.50, 95% confidence interval (CI) = 1.47-1.52), Charlson Comorbidity Index-defined MCCs (OR = 1.45, 95% CI = 1.42-1.47), ACG-defined MCCs (OR = 1.22, 95% CI = 1.19-1.25), and CCW-defined MCCs (OR = 1.15, 95% CI = 1.11-1.19) than for those without MCCs.ConclusionMCC definitions demonstrate poor agreement and should not be used interchangeably. The two definitions with the greatest agreement (CCI, c-SNP) were also the best predictors of 30-day hospital readmissions.

Dataset Information

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis.

Background

Results

Conclusions

Publications

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets