Dataset Information

Global pentapeptide statistics are far away from expected distributions.

ABSTRACT: The relationships between polypeptide composition, sequence, structure and function have been puzzling biologists ever since first protein sequences were determined. Here, we study the statistics of occurrence of all possible pentapeptide sequences in known proteins. To compensate for the non-uniform distribution of individual amino acid residues in protein sequences, we investigate separately all possible permutations of every given amino acid composition. For the majority of permutation groups we find that pentapeptide occurrences deviate strongly from the expected binomial distributions, and that the observed distributions are also characterized by high numbers of outlier sequences. An analysis of identified outliers shows they often contain known motifs and rare amino acids, suggesting that they represent important functional elements. We further compare the pentapeptide composition of regions known to correspond to protein domains with that of non-domain regions. We find that a substantial number of pentapeptides is clearly strongly favored in protein domains. Finally, we show that over-represented pentapeptides are significantly related to known functional motifs and to predicted ancient structural peptides.

SUBMITTER: Poznanski J

PROVIDER: S-EPMC6181984 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Genome-wide Association Studies (GWAS) result in millions of summary statistics ("z-scores") for single nucleotide polymorphism (SNP) associations with phenotypes. These rich datasets afford deep insights into the nature and extent of genetic contributions to complex phenotypes such as psychiatric disorders, which are understood to have substantial genetic components that arise from very large numbers of SNPs. The complexity of the datasets, however, poses a significant challenge to maximizing their utility. This is reflected in a need for better understanding the landscape of z-scores, as such knowledge would enhance causal SNP and gene discovery, help elucidate mechanistic pathways, and inform future study design. Here we present a parsimonious methodology for modeling effect sizes and replication probabilities, relying only on summary statistics from GWAS substudies, and a scheme allowing for direct empirical validation. We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal SNPs. The four-parameter model allows for estimating the degree of polygenicity of the phenotype and predicting the proportion of chip heritability explainable by genome-wide significant SNPs in future studies with larger sample sizes. We apply the model to recent GWAS of schizophrenia (N = 82,315) and putamen volume (N = 12,596), with approximately 9.3 million SNP z-scores in both cases. We show that, over a broad range of z-scores and sample sizes, the model accurately predicts expectation estimates of true effect sizes and replication probabilities in multistage GWAS designs. We assess the degree to which effect sizes are over-estimated when based on linear-regression association coefficients. We estimate the polygenicity of schizophrenia to be 0.037 and the putamen to be 0.001, while the respective sample sizes required to approach fully explaining the chip heritability are 10(6) and 10(5). The model can be extended to incorporate prior knowledge such as pleiotropy and SNP annotation. The current findings suggest that the model is applicable to a broad array of complex phenotypes and will enhance understanding of their genetic architectures.

Project description:Individuals usually develop a sense of place through lived experiences or travel. Here we introduce new and innovative tools to measure sense of place for remote, far-away locations, such as Greenland. We apply this methodology within place-based education to study whether we can distinguish a sense of place between those who have visited Greenland or are otherwise strongly connected to the place from those who never visited. Place-based education research indicates that an increased sense of place has a positive effect on learning outcomes. Thus, we hypothesize that vicarious experiences with a place result in a measurably stronger sense of place when compared to the sense of place of those who have not experienced the place directly. We studied two distinct groups; the first are people with a strong Greenland connection (experts, n = 93). The second are students who have never been there (novices, n = 142). Using i) emotional value attribution of words, ii) thematic analysis of phrases and iii) categorization of words, we show significant differences between novice's and expert's use of words and phrases to describe Greenland as a proxy of sense of place. Emotional value of words revealed statistically significant differences between experts and novices such as word power (dominance), feeling pleasantness (valence), and degree of arousal evoked by the word. While both groups have an overall positive impression of Greenland, 31% of novices express a neutral view with little to no awareness of Greenland (experts 4% neutral). We found differences between experts and novices along dimensions such as natural features; cultural attributes; people of Greenland; concerns, importance, or interest in and feeling connected to Greenland. Experts exhibit more complex place attributes, frequently using emotional words, while novices present a superficial picture of Greenland. Engaging with virtual environments may shift novice learners to a more expert-like sense of place, for a far-away places like Greenland, thus, we suggest virtual field trips can supplement geoscience teaching of concepts in far-away places like Greenland and beyond.

Project description:BackgroundSpurious associations between single nucleotide polymorphisms and phenotypes are a major issue in genome-wide association studies and have led to underestimation of type 1 error rate and overestimation of the number of quantitative trait loci found. Many authors have investigated the influence of population structure on the robustness of methods by simulation. This paper is aimed at developing further the algebraic formalization of power and type 1 error rate for some of the classical statistical methods used: simple regression, two approximate methods of mixed models involving the effect of a single nucleotide polymorphism (SNP) and a random polygenic effect (GRAMMAR and FASTA) and the transmission/disequilibrium test for quantitative traits and nuclear families. Analytical formulae were derived using matrix algebra for the first and second moments of the statistical tests, assuming a true mixed model with a polygenic effect and SNP effects.ResultsThe expectation and variance of the test statistics and their marginal expectations and variances according to the distribution of genotypes and estimators of variance components are given as a function of the relationship matrix and of the heritability of the polygenic effect. These formulae were used to compute type 1 error rate and power for any kind of relationship matrix between phenotyped and genotyped individuals for any level of heritability. For the regression method, type 1 error rate increased with the variability of relationships and with heritability, but decreased with the GRAMMAR method and was not affected with the FASTA and quantitative transmission/disequilibrium test methods.ConclusionsThe formulae can be easily used to provide the correct threshold of type 1 error rate and to calculate the power when designing experiments or data collection protocols. The results concerning the efficacy of each method agree with simulation results in the literature but were generalized in this work. The power of the GRAMMAR method was equal to the power of the FASTA method at the same type 1 error rate. The power of the quantitative transmission/disequilibrium test was low. In conclusion, the FASTA method, which is very close to the full mixed model, is recommended in association mapping studies.

Dataset Information

Global pentapeptide statistics are far away from expected distributions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets