Browse
Submit Data
Databases
API
Help

Dataset Information

19 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Integrating overlapping structures and background information of words significantly improves biological sequence comparison.

ABSTRACT: Word-based models have achieved promising results in sequence comparison. However, as the important statistical properties of words in biological sequence, how to use the overlapping structures and background information of the words to improve sequence comparison is still a problem. This paper proposed a new statistical method that integrates the overlapping structures and the background information of the words in biological sequences. To assess the effectiveness of this integration for sequence comparison, two sets of evaluation experiments were taken to test the proposed model. The first one, performed via receiver operating curve analysis, is the application of proposed method in discrimination between functionally related regulatory sequences and unrelated sequences, intron and exon. The second experiment is to evaluate the performance of the proposed method with f-measure for clustering Hepatitis E virus genotypes. It was demonstrated that the proposed method integrating the overlapping structures and the background information of words significantly improves biological sequence comparison and outperforms the existing models.

SUBMITTER: Dai Q

PROVIDER: S-EPMC3213098 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Minimally-overlapping words for sequence similarity search.

Project description:Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Software to design and test minimally-overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary data are available at Bioinformatics online.

| S-EPMC8016470 | biostudies-literature

MicroRNA Target Site Identification by Integrating Sequence and Binding Information

Project description:High-throughput sequencing has opened numerous possibilities for the identification of regulatory RNA-binding events. Cross-linking and immunoprecipitation of Argonaute protein members can pinpoint microRNA target sites within tens of bases, but leaves the identity of the microRNA unresolved. A flexible computational framework that integrates sequence with cross-linking features reliably identifies the microRNA family involved in each binding event, considerably outperforms sequence-only approaches, and quantifies the prevalence of noncanonical binding modes. Ago2 (Argonaute 2) PAR-CLIP and RNA deep sequencing of Epstein-Barr virus B95.8-infected Lymphoblastoid Cell Lines (LCLs)

2013-05-25 | E-GEOD-46611 | biostudies-arrayexpress

MicroRNA Target Site Identification by Integrating Sequence and Binding Information

2013-05-25 | GSE46611 | GEO

Overlapping genes and the proteins they encode differ significantly in their sequence composition from non-overlapping genes.

Project description:Overlapping genes represent a fascinating evolutionary puzzle, since they encode two functionally unrelated proteins from the same DNA sequence. They originate by a mechanism of overprinting, in which point mutations in an existing frame allow the expression (the "birth") of a completely new protein from a second frame. In viruses, in which overlapping genes are abundant, these new proteins often play a critical role in infection, yet they are frequently overlooked during genome annotation. This results in erroneous interpretation of mutational studies and in a significant waste of resources. Therefore, overlapping genes need to be correctly detected, especially since they are now thought to be abundant also in eukaryotes. Developing better detection methods and conducting systematic evolutionary studies require a large, reliable benchmark dataset of known cases. We thus assembled a high-quality dataset of 80 viral overlapping genes whose expression is experimentally proven. Many of them were not present in databases. We found that overall, overlapping genes differ significantly from non-overlapping genes in their nucleotide and amino acid composition. In particular, the proteins they encode are enriched in high-degeneracy amino acids and depleted in low-degeneracy ones, which may alleviate the evolutionary constraints acting on overlapping genes. Principal component analysis revealed that the vast majority of overlapping genes follow a similar composition bias, despite their heterogeneity in length and function. Six proven mammalian overlapping genes also followed this bias. We propose that this apparently near-universal composition bias may either favour the birth of overlapping genes, or/and result from selection pressure acting on them.

| S-EPMC6195259 | biostudies-literature

Integrating genetic, transcriptional, and biological information provides insights into obesity.

Project description:OBJECTIVE:Indices of body fat distribution are heritable, but few genetic signals have been reported from genome-wide association studies (GWAS) of computed tomography (CT) imaging measurements of body fat distribution. We aimed to identify genes associated with adiposity traits and the key drivers that are central to adipose regulatory networks. SUBJECTS:We analyzed gene transcript expression data in blood from participants in the Framingham Heart Study, a large community-based cohort (n up to 4303), as well as implemented an integrative analysis of these data and existing biological information. RESULTS:Our association analyses identified unique and common gene expression signatures across several adiposity traits, including body mass index, waist-hip ratio, waist circumference, and CT-measured indices, including volume and quality of visceral and subcutaneous adipose tissues. We identified six enriched KEGG pathways and two co-expression modules for further exploration of adipose regulatory networks. The integrative analysis revealed four gene sets (Apoptosis, p53 signaling pathway, Proteasome, Ubiquitin-mediated proteolysis) and two co-expression modules with significant genetic variants and 94 key drivers/genes whose local networks were enriched with adiposity-associated genes, suggesting that these enriched pathways or modules have genetic effects on adiposity. Most identified key driver genes are involved in essential biological processes such as controlling cell cycle, DNA repair, and degradation of regulatory proteins are cancer related. CONCLUSIONS:Our integrative analysis of genetic, transcriptional, and biological information provides a list of compelling candidates for further follow-up functional studies to uncover the biological mechanisms underlying obesity. These candidates highlight the value of examining CT-derived and central adiposity traits.

| S-EPMC6405310 | biostudies-literature

MicroRNA target site identification by integrating sequence and binding information.

Project description:High-throughput sequencing has opened numerous possibilities for the identification of regulatory RNA-binding events. Cross-linking and immunoprecipitation of Argonaute proteins can pinpoint a microRNA (miRNA) target site within tens of bases but leaves the identity of the miRNA unresolved. A flexible computational framework, microMUMMIE, integrates sequence with cross-linking features and reliably identifies the miRNA family involved in each binding event. It considerably outperforms sequence-only approaches and quantifies the prevalence of noncanonical binding modes.

| S-EPMC3818907 | biostudies-literature

Species abundance information improves sequence taxonomy classification accuracy.

Project description:Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments.

| S-EPMC6789115 | biostudies-literature

Integrating thermodynamic and sequence contexts improves protein-RNA binding prediction.

Project description:Predicting RNA-binding protein (RBP) specificity is important for understanding gene expression regulation and RNA-mediated enzymatic processes. It is widely believed that RBP binding specificity is determined by both the sequence and structural contexts of RNAs. Existing approaches, including traditional machine learning algorithms and more recently, deep learning models, have been extensively applied to integrate RNA sequence and its predicted or experimental RNA structural probabilities for improving the accuracy of RBP binding prediction. Such models were trained mostly on the large-scale in vitro datasets, such as the RNAcompete dataset. However, in RNAcompete, most synthetic RNAs are unstructured, which makes machine learning methods not effectively extract RBP-binding structural preferences. Furthermore, RNA structure may be variable or multi-modal according to both theoretical and experimental evidence. In this work, we propose ThermoNet, a thermodynamic prediction model by integrating a new sequence-embedding convolutional neural network model over a thermodynamic ensemble of RNA secondary structures. First, the sequence-embedding convolutional neural network generalizes the existing k-mer based methods by jointly learning convolutional filters and k-mer embeddings to represent RNA sequence contexts. Second, the thermodynamic average of deep-learning predictions is able to explore structural variability and improves the prediction, especially for the structured RNAs. Extensive experiments demonstrate that our method significantly outperforms existing approaches, including RCK, DeepBind and several other recent state-of-the-art methods for predictions on both in vitro and in vivo data. The implementation of ThermoNet is available at https://github.com/suyufeng/ThermoNet.

| S-EPMC6752863 | biostudies-literature

Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

Project description:MotivationAlignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized.ResultsOur model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences.Availability and implementationAll the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html.Contactivan.borozan@gmail.comSupplementary informationSupplementary data are available at Bioinformatics online.

| S-EPMC4410667 | biostudies-literature

LongSAGE analysis significantly improves genome annotation

Project description:Owing to its increased tag length, LongSAGE tags are expected to be more reliable in direct assignment to genome sequences. Therefore, we evaluated the use of LongSAGE data in genome annotation by using our LongSAGE dataset of 202 015 tags (consisting of 41 718 unique tags), experimentally generated from mouse embryonic tail libraries. RESULTS: A fraction of LongSAGE tags could not be unambiguously assigned to its gene, due to the presence of widely conserved sequences downstream of particular CATG anchor sites. The presence of alternative forms of transcripts was confirmed in 45% of all detected genes. Surprisingly, a large fraction of LongSAGE tags with hits to the genome (66%) could not be assigned to any gene annotated in EnsEMBL. Among such cases, 2098 LongSAGE tags fell into a region containing a putative gene predicted by GenScan, providing experimental evidence for the presence of real genes, while 9112 genes were found out to be left out or wrongly annotated by the EnsEMBL pipeline. CONCLUSIONS: LongSAGE transcriptome data can significantly improve the genome annotation by identifying novel genes and alternative transcripts, even in the case of thus far best-characterized organisms like the mouse. Keywords: other Owing to its increased tag length, LongSAGE tags are expected to be more reliable in direct assignment to genome sequences. Therefore, we evaluated the use of LongSAGE data in genome annotation by using our LongSAGE dataset of 202 015 tags (consisting of 41 718 unique tags), experimentally generated from mouse embryonic tail libraries.

2005-07-20 | E-GEOD-2967 | biostudies-arrayexpress

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data