Dataset Information

N-gram analysis of 970 microbial organisms reveals presence of biological language models.

ABSTRACT:

Background

It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree.

Results

We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli.

Conclusion

Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.

SUBMITTER: Osmanbeyoglu HU

PROVIDER: S-EPMC3027111 | biostudies-literature | 2011 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

N-gram analysis of 970 microbial organisms reveals presence of biological language models.

Osmanbeyoglu Hatice Ulku HU Ganapathiraju Madhavi K MK

BMC bioinformatics 20110110

<h4>Background</h4>It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative ana ...[more]

PMID: 21219653

Similar Datasets

Project description:The aim of this study was to identify, quantify and prioritize for the first time the sources of uncertainty in a mechanistic model describing the anaerobic-aerobic metabolism of phosphorus accumulating organisms (PAO) in enhanced biological phosphorus removal (EBPR) systems. These wastewater treatment systems play an important role in preventing eutrophication and metabolic models provide an advanced tool for improving their stability via system design, monitoring and prediction. To this end, a global sensitivity analysis was conducted using standard regression coefficients and Sobol sensitivity indices, taking into account the effect of 39 input parameters on 10 output variables. Input uncertainty was characterized with data in the literature and propagated to the output using the Monte Carlo method. The low degree of linearity between input parameters and model outputs showed that model simplification by linearization can be pursued only in very well defined circumstances. Differences between first and total-order sensitivity indices showed that variance in model predictions was due to interactions between combinations of inputs, as opposed to the direct effect of individual inputs. The major sources of uncertainty affecting the prediction of liquid phase concentrations, as well as intra-cellular glycogen and poly-phosphate was due to 64% of the input parameters. In contrast, the contribution to variance in intra-cellular PHA constituents was uniformly distributed among all inputs. In addition to the intra-cellular biomass constituents, notably PHB, PH2MV and glycogen, uncertainty with respect to input parameters directly related to anaerobic propionate uptake, aerobic poly-phosphate formation, glycogen formation and temperature contributed most to the variance of all model outputs. Based on the distribution of total-order sensitivities, characterization of the influent stream and intra-cellular fractions of PHA can be expected to significantly improve model reliability. The variance of EBPR metabolic model predictions was quantified. The means to account for this variance, with respect to each quantity of interest, given knowledge of the corresponding input uncertainties, was prescribed. On this basis, possible avenues and pre-requisite requirements to simplify EBPR metabolic models for PAO, both structurally via linearization, as well as by reduction of the number of non-influential variables were outlined.

Project description:Gepotidacin (formerly called GSK2140944) is a novel triazaacenaphthylene bacterial topoisomerase inhibitor with in vitro activity against conventional and biothreat pathogens, including Staphylococcus aureus and Streptococcus pneumoniae Using neutropenic murine thigh and lung infection models, the pharmacokinetics-pharmacodynamics (PK-PD) of gepotidacin against S. aureus and S. pneumoniae were characterized. Candidate models were fit to single-dose PK data from uninfected mice (for doses of 16 to 128 mg/kg of body weight given subcutaneously [s.c.]). Dose fractionation studies (1 isolate/organism; 2 to 512 mg/kg/day) and dose-ranging studies (5 isolates/organism; 2 to 2,048 mg/kg/day; MIC ranges of 0.5 to 2 mg/liter for S. aureus and 0.125 to 1 mg/liter for S. pneumoniae) were conducted. The presence of an in vivo postantibiotic effect (PAE) was also evaluated. Relationships between the change from baseline in log10 CFU at 24 h and the ratio of the free-drug plasma area under the concentration-time curve (AUC) to the MIC (AUC/MIC ratio), the ratio of the maximum concentration of drug in plasma (Cmax) to the MIC (Cmax/MIC ratio), and the percentage of a 24-h period that the drug concentration exceeded the MIC (%T>MIC) were evaluated using Hill-type models. Plasma and epithelial lining fluid (ELF) PK data were best fit by a four-compartment model with linear distributional clearances, a capacity-limited clearance, and a first-order absorption rate. The ELF penetration ratio in uninfected mice was 0.65. Since the growth of both organisms was poor in the murine lung infection model, lung efficacy data were not reported. As determined using the murine thigh infection model, the free-drug plasma AUC/MIC ratio was the PK-PD index most closely associated with efficacy (r2 = 0.936 and 0.897 for S. aureus and S. pneumoniae, respectively). Median free-drug plasma AUC/MIC ratios of 13.4 and 58.9 for S. aureus, and 7.86 and 16.9 for S. pneumoniae, were associated with net bacterial stasis and a 1-log10 CFU reduction from baseline, respectively. Dose-independent PAE durations of 3.07 to 12.5 h and 5.25 to 8.46 h were demonstrated for S. aureus and S. pneumoniae, respectively.

Project description:BackgroundCurrent research on amniotic fluid (AF) microbiota yields contradictory data, necessitating an accurate, comprehensive, and scientifically rigorous evaluation.ObjectiveThis study aimed to characterise the microbial features of AF and explore the correlation between microbial information and clinical parameters.Methods76 AF samples were collected in this prospective cohort study. Fourteen samples were utilised to establish the nanopore metagenomic sequencing methodology, whereas the remaining 62 samples underwent a final statistical analysis along with clinical information. Negative controls included the operating room environment (OE), surgical instruments (SI), and laboratory experimental processes (EP) to elucidate the background contamination at each step. Simultaneously, levels of five cytokines (IL-1β, IL-6, IL-8, TNF-α, MMP-8) in AF were assessed.ResultsAmong the 62 AF samples, microbial analysis identified seven without microbes and 55 with low microbial diversity and abundance. No significant clinical differences were observed between AF samples with and without microbes. The correlation between microbes and clinical parameters in AF with normal chromosomal structure revealed noteworthy findings. In particular, the third trimester exhibited richer microbial diversity. Pseudomonas demonstrated higher detection rates and relative abundance in the second trimester and Preterm Birth (PTB) groups. S. yanoikuyae in the PTB group exhibited elevated detection frequencies and relative abundance. Notably, Pseudomonas negatively correlated with activated partial thromboplastin time (APTT) (r = -0.329, P = 0.016), while Staphylococcus showed positive correlations with APTT (r = 0.395, P = 0.003). Furthermore, Staphylococcus negatively correlated with birth weight (r = -0.297, P = 0.034).ConclusionMost AF samples exhibited low microbial diversity and abundance. Certain microbes in AF may correlate with clinical parameters such as gestational age and PTB. However, these associations require further investigation. It is essential to expand the sample size and undertake more comprehensive research to elucidate the clinical implications of microbial presence in AF.

Project description:BackgroundTraditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way.ResultsHere, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files.ConclusionsThe power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them.

Dataset Information

N-gram analysis of 970 microbial organisms reveals presence of biological language models.

Background

Results

Conclusion

Publications

N-gram analysis of 970 microbial organisms reveals presence of biological language models.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets