Dataset Information

PCA2GO: a new multivariate statistics based method to identify highly expressed GO-Terms.

ABSTRACT:

Background

Several tools have been developed to explore and search Gene Ontology (GO) databases allowing efficient GO enrichment analysis and GO tree visualization. Nevertheless, identification of highly specific GO-terms in complex data sets is relatively complicated and the display of GO term assignments and GO enrichment analysis by simple tables or pie charts is not optimal. Valuable information such as the hierarchical position of a single GO term within the GO tree (topological ordering), or enrichment within a complex set of biological experiments is not displayed. Pie charts based on GO tree levels are, themselves, one-dimensional graphs, which cannot properly or efficiently represent the hierarchical specificity for the biological system being studied.

Results

Here we present a new method, which we name PCA2GO, capable of GO analysis using complex multidimensional experimental settings. We employed principal component analysis (PCA) and developed a new score, which takes into account the relative frequency of certain GO terms and their specificity (hierarchical position) within the GO graph. We evaluated the correlation between our representation score R and a standard measure of enrichment, namely p-values to convey the versatility of our approach to other methods and point out differences between our method and commonly used enrichment analyses. Although p values and the R score formally measure different quantities they should be correlated, because relative frequencies of GO terms occurrences within a dataset are an indirect measure of protein numbers related to this term. Therefore they are also related to enrichment. We showed that our score enables us to identify more specific GO-terms i.e. those positioned further down the GO-graph than other common tools used for this purpose. PCA2GO allows visualization and detection of multidimensional dependencies both within the acyclic graph (GO tree) and the experimental settings. Our method is intended for the analysis of several experimental sets, not for one set, like standard enrichment tools. To demonstrate the usefulness of our approach we performed a PCA2GO analysis of a fractionated cardiomyocyte protein dataset, which was identified by enhanced liquid chromatography-mass spectrometry (GeLC-MS). The analysis enabled us to detect distinct groups of proteins, which accurately reflect properties of biochemical cell fractions.

Conclusions

We conclude that PCA2GO is an alternative efficient GO analysis tool with unique features for detection and visualization of multidimensional dependencies within the dataset under study. PCA2GO reveals strongly correlated GO terms within the experimental setting (in this case different fractions) by PCA group formation and improves detection of more specific GO terms within experiment dependent GO term groups than standard p value calculations.

SUBMITTER: Bruckskotten M

PROVIDER: S-EPMC2910024 | biostudies-literature | 2010 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

PCA2GO: a new multivariate statistics based method to identify highly expressed GO-Terms.

Bruckskotten Marc M Looso Mario M Cemiĉ Franz F Konzer Anne A Hemberger Jürgen J Krüger Marcus M Braun Thomas T

BMC bioinformatics 20100621

<h4>Background</h4>Several tools have been developed to explore and search Gene Ontology (GO) databases allowing efficient GO enrichment analysis and GO tree visualization. Nevertheless, identification of highly specific GO-terms in complex data sets is relatively complicated and the display of GO term assignments and GO enrichment analysis by simple tables or pie charts is not optimal. Valuable information such as the hierarchical position of a single GO term within the GO tree (topological ord ...[more]

PMID: 20565932

Similar Datasets

Project description:BACKGROUND: Gene Ontology (GO) annotation, which describes the function of genes and gene products across species, has recently been used to predict protein subcellular and subnuclear localization. Existing GO-based prediction methods for protein subcellular localization use the known accession numbers of query proteins to obtain their annotated GO terms. An accurate prediction method for predicting subcellular localization of novel proteins without known accession numbers, using only the input sequence, is worth developing. RESULTS: This study proposes an efficient sequence-based method (named ProLoc-GO) by mining informative GO terms for predicting protein subcellular localization. For each protein, BLAST is used to obtain a homology with a known accession number to the protein for retrieving the GO annotation. A large number n of all annotated GO terms that have ever appeared are then obtained from a large set of training proteins. A novel genetic algorithm based method (named GOmining) combined with a classifier of support vector machine (SVM) is proposed to simultaneously identify a small number m out of the n GO terms as input features to SVM, where m <<n. The m informative GO terms contain the essential GO terms annotating subcellular compartments such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton). Two existing data sets SCL12 (human protein with 12 locations) and SCL16 (Eukaryotic proteins with 16 locations) with <25% sequence identity are used to evaluate ProLoc-GO which has been implemented by using a single SVM classifier with the m = 44 and m = 60 informative GO terms, respectively. ProLoc-GO using input sequences yields test accuracies of 88.1% and 83.3% for SCL12 and SCL16, respectively, which are significantly better than the SVM-based methods, which achieve < 35% test accuracies using amino acid composition (AAC) with acid pairs and AAC with dipedtide composition. For comparison, ProLoc-GO using known accession numbers of query proteins yields test accuracies of 90.6% and 85.7%, which is also better than Hum-PLoc (85.0%) and Euk-OET-PLoc (83.7%) using ensemble classifiers with hybridization of GO terms and amphiphilic pseudo amino acid composition for SCL12 and SCL16, respectively. CONCLUSION: The growth of Gene Ontology in size and popularity has increased the effectiveness of GO-based features. GOmining can serve as a tool for selecting informative GO terms in solving sequence-based prediction problems. The prediction system using ProLoc-GO with input sequences of query proteins for protein subcellular localization has been implemented (see Availability).

Project description:Many tests can crudely quantify age-related mobility decrease but instrumented versions of mobility tests could increase their specificity and sensitivity. The Timed-up-and-Go (TUG) test includes several elements that people use in daily life. The test has different transition phases: rise from a chair, walk, 180° turn, walk back, turn, and sit-down on a chair. For this reason the TUG is an often used test to evaluate in a standardized way possible decline in balance and walking ability due to age and or pathology. Using inertial sensors, qualitative information about the performance of the sub-phases can provide more specific information about a decline in balance and walking ability. The first aim of our study was to identify variables extracted from the instrumented timed-up-and-go (iTUG) that most effectively distinguished performance differences across age (age 18-75). Second, we determined the discriminative ability of those identified variables to classify a younger (age 18-45) and older age group (age 46-75). From healthy adults (n = 59), trunk accelerations and angular velocities were recorded during iTUG performance. iTUG phases were detected with wavelet-analysis. Using a Partial Least Square (PLS) model, from the 72-iTUG variables calculated across phases, those that explained most of the covariance between variables and age were extracted. Subsequently, a PLS-discriminant analysis (DA) assessed classification power of the identified iTUG variables to discriminate the age groups. 27 variables, related to turning, walking and the stand-to-sit movement explained 71% of the variation in age. The PLS-DA with these 27 variables showed a sensitivity and specificity of 90% and 85%. Based on this model, the iTUG can accurately distinguish young and older adults. Such data can serve as a reference for pathological aging with respect to a widely used mobility test. Mobility tests like the TUG supplemented with smart technology could be used in clinical practice.

Dataset Information

PCA2GO: a new multivariate statistics based method to identify highly expressed GO-Terms.

Background

Results

Conclusions

Publications

PCA2GO: a new multivariate statistics based method to identify highly expressed GO-Terms.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets