Dataset Information

Generation of Gene Ontology benchmark datasets with various types of positive signal.

ABSTRACT:

Background

The analysis of over-represented functional classes in a list of genes is one of the most essential bioinformatics research topics. Typical examples of such lists are the differentially expressed genes from transcriptional analysis which need to be linked to functional information represented in the Gene Ontology (GO). Despite the importance of this procedure, there is a little work on consistent evaluation of various GO analysis methods. Especially, there is no literature on creating benchmark datasets for GO analysis tools.

Results

We propose a methodology for the evaluation of GO analysis tools, which consists of creating gene lists with a selected signal level and a selected number of independent over-represented classes. The methodology starts with a real life GO data matrix, and therefore the generated datasets have similar features to real positive datasets. The user can select the signal level for over-representation, the number of independent positive classes in the dataset, and the size of the final gene list. We present the use of the effective number and various normalizations while embedding the signal to a selected class or classes and the use of binary correlation to ensure that the selected signal classes are independent with each other. The usefulness of generated datasets is demonstrated by comparing different GO class ranking and GO clustering methods.

Conclusion

The presented methods aid the development and evaluation of GO analysis methods as they enable thorough testing with different signal types and different signal levels. As an example, our comparisons reveal clear differences between compared GO clustering and GO de-correlation methods. The implementation is coded in Matlab and is freely available at the dedicated website http://ekhidna.biocenter.helsinki.fi/users/petri/public/POSGODA/POSGODA.html.

SUBMITTER: Toronen P

PROVIDER: S-EPMC2762998 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Purpose: To identify distinct gene expression and functional profiles for the three main cell types (epithelial, keratocyte and endothelial) of the human cornea. Methods: RNA-sequencing was performed using total RNA isolated from ex vivo corneal epithelial cells (evCEpC), keratocytes (evK) and endothelial cells (evCEnC) obtained from 3 donor corneas obtained from a commercial eye bank. Transcriptomic analysis was performed using Kallisto (alignment (hg38) and quantification) and Sleuth (differential gene expression(DGE)), with transcript abundances calculated as transcripts per kilobase million (TPM). Expression was defined as TPM≥7.5 and significant DGE as a fold-change ≥4 and a false-discovery rate adjusted p-value≤0.05. Cell type specificity was defined as genes adhering to the above expression and DGE criteria and not expressed (i.e., TPM<7.5) in the other two cell types. Gene ontology enrichment analysis was performed on the cell type-specific gene lists using the gene group functional profiling (g:GOSt) tool within the web-based g:Profiler suite. Results: We identified 205 genes specific in evCEpC, and enrichment of epithelial-associated GO terms (e.g., cornified envelope, epidermis development, cell-cell adhesion). We identified 76 genes specific in evK, and fibroblast-associated GO terms (e.g., collagen metabolic process, metallopeptidase activity, extracellular matrix organization). We identified 96 genes specific in evCEnC, and at least one CEnC-associated GO term (e.g., cellular cation homeostasis) but several synapse-associated GO terms (e.g., synapse organization, synapse part). Conclusions: The human cornea is comprised of three main cell types that play important roles in maintaining corneal clarity and vision. Our results demonstrate that a small subset (0.4-1%) of the protein-coding genes confers distinct and classic functional properties associated with each cell type. In addition, we identified a novel association and significant functional overlap between CEnC and synapses. This may lead to insights into the molecular mechanisms of transendothelial transport and secretion, which are integral metabolic features of the corneal endothelium.

Project description:BackgroundSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset.MethodsWe identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study.ResultsThe benchmark datasets focus on the two most widely used sequencing platforms: long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets: three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub: https://github.com/CDCgov/datasets-sars-cov-2.DiscussionThe datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines.

Project description:Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data. Database URL: http://structure.bmc.lu.se/VariBench.