Dataset Information

A multispecies polyadenylation site model.

ABSTRACT: Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, poly(A) sites in other species, such as plants and invertebrates, exhibit high deviation from this genomic structure, making the construction of a general poly(A) site recognition model challenging. We surveyed nine poly(A) site prediction methods published between 1999 and 2011. All methods exploit the skewed nucleotide profile across the poly(A) sites, and the highly conserved poly(A) signal as the primary features for recognition. These methods typically use a large number of features, which increases the dimensionality of the models to crippling degrees, and typically are not validated against many kinds of genomes.We propose a poly(A) site model that employs minimal features to capture the essence of poly(A) sites, and yet, produces better prediction accuracy across diverse species. Our model consists of three dior-trinucleotide profiles identified through principle component analysis, and the predicted nucleosome occupancy flanking the poly(A) sites. We validated our model using two machine learning methods: logistic regression and linear discriminant analysis. Results show that models achieve 85-92% sensitivity and 85-96% specificity in seven animals and plants. When we applied one model from one species to predict poly(A) sites from other species, the sensitivity scores correlate with phylogenetic distances.A four-feature model geared towards small motifs was sufficient to accurately learn and predict poly(A) sites across eukaryotes.

SUBMITTER: Ho ES

PROVIDER: S-EPMC3549828 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A multispecies polyadenylation site model.

Ho Eric S ES Gunderson Samuel I SI Duffy Siobain S

BMC bioinformatics 20130121

<h4>Background</h4>Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, poly(A) sites in other species, such as plants and invertebrates, exhibit high deviation from this genomic stru ...[more]

PMID: 23368518

Similar Datasets

Project description:Conservation of biological communities requires accurate estimates of abundance for multiple species. Recent advances in estimating abundance of multiple species, such as Bayesian multispecies N-mixture models, account for multiple sources of variation, including detection error. However, false-positive errors (misidentification or double counts), which are prevalent in multispecies data sets, remain largely unaddressed. The dependent-double observer (DDO) method is an emerging method that both accounts for detection error and is suggested to reduce the occurrence of false positives because it relies on two observers working collaboratively to identify individuals. To date, the DDO method has not been combined with advantages of multispecies N-mixture models. Here, we derive an extension of a multispecies N-mixture model using the DDO survey method to create a multispecies dependent double-observer abundance model (MDAM). The MDAM uses a hierarchical framework to account for biological and observational processes in a statistically consistent framework while using the accurate observation data from the DDO survey method. We demonstrate that the MDAM accurately estimates abundance of multiple species with simulated and real multispecies data sets. Simulations showed that the model provides both precise and accurate abundance estimates, with average credible interval coverage across 100 repeated simulations of 94.5% for abundance estimates and 92.5% for detection estimates. In addition, 92.2% of abundance estimates had a mean absolute percent error between 0% and 20%, with a mean of 7.7%. We present the MDAM as an important step forward in expanding the applicability of the DDO method to a multispecies setting. Previous implementation of the DDO method suggests the MDAM can be applied to a broad array of biological communities. We suggest that researchers interested in assessing biological communities consider the MDAM as a tool for deriving accurate, multispecies abundance estimates.

Dataset Information

A multispecies polyadenylation site model.

Publications

A multispecies polyadenylation site model.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets