Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

A systematic evaluation of pattern discovery algorithms

ABSTRACT: Pattern discovery algorithms are methods for discovering recurrent, non-random motifs widely used in the analysis of biological sequences. Many algorithms exist but few comparisons have been made amongst them. We systematically profile eight representative methods at multiple parameter settings across 174 diverse experimental datasets, including ten novel ChIP-on-chip datasets. We executed 16,777 pattern discovery analyses to assess prediction accuracy, CPU usage and memory consumption. For 144 datasets we developed a gold-standard using machine-learning algorithms; cross-validation was used for the remaining datasets. Performance was highly disparate, with median accuracy ranging from 32% to 96%. Importantly we were unable to replicate previously reported algorithm-rankings, emphasizing the need to use many and diverse experimental datasets. We found deterministic algorithms like Projection and Oligo/Dyad had the highest prediction accuracy. Computational efficiency was not linearly related to dataset size and becomes critical: some algorithms are intractably slow on large datasets. This work provides the first combined assessment of the CPU, memory, and prediction accuracies of pattern discovery algorithms on real experimental datasets.

ORGANISM(S): Homo sapiens

PROVIDER: GSE15370 | GEO | 2009/11/24

SECONDARY ACCESSION(S): PRJNA117101

REPOSITORIES: GEO

ACCESS DATA

Json Xml

Dataset's files

Source:

			Action	DRS
		Other

Items per page:

1 - 1 of 1

Similar Datasets

A systematic evaluation of pattern discovery algorithms

Project description:Pattern discovery algorithms are methods for discovering recurrent, non-random motifs widely used in the analysis of biological sequences. Many algorithms exist but few comparisons have been made amongst them. We systematically profile eight representative methods at multiple parameter settings across 174 diverse experimental datasets, including ten novel ChIP-on-chip datasets. We executed 16,777 pattern discovery analyses to assess prediction accuracy, CPU usage and memory consumption. For 144 datasets we developed a gold-standard using machine-learning algorithms; cross-validation was used for the remaining datasets. Performance was highly disparate, with median accuracy ranging from 32% to 96%. Importantly we were unable to replicate previously reported algorithm-rankings, emphasizing the need to use many and diverse experimental datasets. We found deterministic algorithms like Projection and Oligo/Dyad had the highest prediction accuracy. Computational efficiency was not linearly related to dataset size and becomes critical: some algorithms are intractably slow on large datasets. This work provides the first combined assessment of the CPU, memory, and prediction accuracies of pattern discovery algorithms on real experimental datasets. HL60-Mnt-ChIP: ChIP-Chip with 10 biological replicates HL60-Trrap-ChIP: ChIP-Chip with 13 biological replicates

2010-05-19 | E-GEOD-15370 | biostudies-arrayexpress

Homo sapiens

Project description:A systematic evaluation of pattern discovery algorithms

| PRJNA117101 | ENA

Novel automated workflow for spectral alignment and mass calibration in MS imaging using a sputtered Ag nanolayer

Project description:Mass spectrometry imaging (MSI) is a technique that can map analyte spatial distribution directly onto a tissue section. This enables the spatial correlation of molecular entities with a tissue morphology to be investigated. Analyte annotation in MSI is intrinsically linked to the mass accuracy of the data. Mass accuracy and analyte identification are determined by such factors as the experimental set up and the data processing workflow. We present an MSI data processing workflow that uses a label-free approach to compensate for mass shifts. The algorithms developed were designed to perform efficiently even for datasets much larger than computer's memory. Herein, we present the application of the developed processing workflow to a dataset with more than 13.000 pixels and ∼50.000 mass channels. We assessed the overall mass accuracy in the range m/z 400 to 1200 using silver and gold sputtered nanolayers. With our novel processing workflow we were able to obtain mass errors as low as 5 ppm using a TOF instrument.

2022-01-20 | MTBLS587 | MetaboLights

Effects of sample size on differential gene expression, rank order and prediction accuracy of a gene signature

Project description:Gene expression profiles were generated from muscle biopsies from 134 individuals, and differences in expression based on sex were explored. Top differentially expressed gene lists are often inconsistent between studies and it has been suggested that small sample sizes contribute to lack of reproducibility and poor prediction accuracy in discriminative models. We considered sex differences (69M-bM-^YM-^B, 65M-bM-^YM-^@) in 134 human skeletal muscle biopsies using DNA microarray. The full dataset and subsamples (n= 10 (5M-bM-^YM-^B, 5M-bM-^YM-^@) to n=120 (60M-bM-^YM-^B, 60M-bM-^YM-^@)) thereof were used to assess the effect of sample size on the differential expression of single genes, gene rank order and prediction accuracy. Using our full dataset (n=134), we identified 717 differentially expressed transcripts (p-value < 0.0001; false discovery rate < 0.006) and we were able to predict sex with 92% accuracy, both within our dataset and on external datasets. Both p-values and rank order of top differentially expressed genes became more variable using smaller subsamples. For example, at n=10 (5M-bM-^YM-^B, 5M-bM-^YM-^@), no gene was considered differentially expressed at p<0.0001 and prediction accuracy was ~50% (no better than chance). We found that sample size clearly affects microarray analysis results; small sample sizes result in unstable gene lists and poor prediction accuracy. We anticipate this will apply to other phenotypes, in addition to sex. RNA was isolated from 134 muscle samples. Gene expression is compared between males and females.

2014-05-15 | E-GEOD-41726 | biostudies-arrayexpress

devCellPy: A machine learning-enabled pipeline for automated annotation of complex multilayered single-cell transcriptomic data

Project description:A major informatic challenge in single cell RNA-sequencing analysis is the precise annotation of datasets where cells exhibit complex multilayered identities or transitory states. Here, we present devCellPy a highly accurate and precise machine learning-enabled tool that enables automated prediction of cell types across complex annotation hierarchies. To demonstrate the power of devCellPy, we construct a murine cardiac developmental atlas from published datasets encompassing 104,199 cells from E6.5-E16.5 and train devCellPy to generate a cardiac prediction algorithm. Using this algorithm, we observe a high prediction accuracy (>90%) across multiple layers of annotation and across de novo murine developmental data. Furthermore, we conduct a cross-species prediction of cardiomyocyte subtypes from in vitro-derived human induced pluripotent stem cells and unexpectedly uncover a predominance of left ventricular (LV) identity that we confirmed by an LV-specific TBX5 lineage tracing system. Together, our results show devCellPy to be a powerful tool for automated cell prediction across complex cellular hierarchies, species, and experimental systems.

2022-08-14 | GSE184943 | GEO

Effects of sample size on differential gene expression, rank order and prediction accuracy of a gene signature

Project description:Gene expression profiles were generated from muscle biopsies from 134 individuals, and differences in expression based on sex were explored. Top differentially expressed gene lists are often inconsistent between studies and it has been suggested that small sample sizes contribute to lack of reproducibility and poor prediction accuracy in discriminative models. We considered sex differences (69♂, 65♀) in 134 human skeletal muscle biopsies using DNA microarray. The full dataset and subsamples (n= 10 (5♂, 5♀) to n=120 (60♂, 60♀)) thereof were used to assess the effect of sample size on the differential expression of single genes, gene rank order and prediction accuracy. Using our full dataset (n=134), we identified 717 differentially expressed transcripts (p-value < 0.0001; false discovery rate < 0.006) and we were able to predict sex with 92% accuracy, both within our dataset and on external datasets. Both p-values and rank order of top differentially expressed genes became more variable using smaller subsamples. For example, at n=10 (5♂, 5♀), no gene was considered differentially expressed at p<0.0001 and prediction accuracy was ~50% (no better than chance). We found that sample size clearly affects microarray analysis results; small sample sizes result in unstable gene lists and poor prediction accuracy. We anticipate this will apply to other phenotypes, in addition to sex.

2014-05-15 | GSE41726 | GEO

Machine learning for discovery: deciphering RNA splicing logic

Project description:Machine learning methods, particularly neural networks trained on large datasets, are transforming how scientists approach scientific discovery and experimental design. However, current state-of-the-art neural networks are limited by their uninterpretability: despite providing accurate predictions, they cannot describe how they arrived at their predictions. Here, using an ``interpretable-by-design'' approach, we present a neural network model that provides insights into RNA splicing, a fundamental process in the transfer of genomic information into functional biochemical products. Although we designed our model to emphasize interpretability, its predictive accuracy is on par with state-of-the-art models. To demonstrate the model's interpretability, we introduce a visualization that, for any given exon, allows us to trace and quantify the entire decision process from input sequence to output splicing prediction. Importantly, the model revealed novel components of the splicing logic, which we experimentally validated. This study highlights how interpretable machine learning can advance scientific discovery.

2022-10-01 | GSE200096 | GEO

Application of de novo sequencing to large-scale complex proteomics datasets

Project description:Dependent on concise, pre-defined protein sequence databases, traditional search algorithms perform poorly when analyzing mass spectra derived from wholly uncharacterized protein products. Conversely, de novo peptide sequencing algorithms can interpret mass spectra without relying on reference databases. However, such algorithms have been difficult to apply to complex protein mixtures, in part due to a lack of methods for automatically validating de novo sequencing results. Here, we present novel metrics for benchmarking de novo sequencing algorithm performance on large scale proteomics datasets, and present a method for accurately calibrating false discovery rates on de novo results. We also present a novel algorithm (LADS) which leverages experimentally disambiguated fragmentation spectra to boost sequencing accuracy and sensitivity. LADS improves sequencing accuracy on longer peptides relative to other algorithms and improves discriminability of correct and incorrect sequences. Using these advancements, we demonstrate accurate de novo identification of peptide sequences not identifiable using database search-based approaches.

2016-01-12 | PXD003317 | Pride

EpiPanGI Dx: A cell-free DNA methylation fingerprint for the early detection of gastrointestinal cancers

Project description:Owing to high cancer-specificity, DNA methylation alterations have emerged as front-runners in cell-free DNA (cf-DNA) biomarker development. However, much effort to date has focused on single cancers and have not explored the possibility of developing a pan-cancer diagnostic assay. Here, we undertook a genome-wide DNA methylation analysis for multiple gastrointestinal (GI) cancers, to develop a panGI diagnostic assay. By analyzing the DNA methylation data from 1940 tumor and adjacent normal tissues from TCGA and GSE72872 datasets, we first identified the differentially methylated regions (DMRs) between individual GI cancers and adjacent normal tissues, as well as across all GI cancers. We next prioritized a list of 67,832 tissue DMRs encompassing a 25.6 Mb genomic region by incorporating all significant DMRs across various GI cancers to design a custom SeqCap Epi, targeted bisulfite sequencing platform. Subsequent investigation of these tissue-specific DMRs in 300 cf-DNA specimens and applying machine learning algorithms led to the development of three distinct categories of DMR panels: 1) Cancer-specific biomarker panels with an AUC values of 0.98 (Colorectal cancer, CRC), 0.94 (Esophageal squamous cell carcinoma, ESCC), 0.90 (Esophageal adenocarcinoma, EAC), 0.90 (Gastric cancer, GC), 0.98 (Hepatocellular carcinoma, HCC), and 0.85 (Pancreatic ductal adenocarcinoma, PDAC); 2) A pan-GI panel that detected all GI cancers with an AUC of 0.90; and 3) A multi-cancer prediction panel, EpiPanGI Dx, with a prediction accuracy around 0.85-0.95 for most GI cancers. Utilizing a novel biomarker discovery approach, we provide first evidence for a cell-free DNA methylation biomarker assay that offer a robust diagnostic accuracy for all gastrointestinal cancers.

2021-08-27 | GSE149438 | GEO

Improved prediction of endogenous HLA-associated epitopes based on mono-allelic mass spectrometry profiling

Project description:LC-MS/MS-based identification of HLA-peptides is poised to provide a deep understanding of the rules underlying antigen presentation. However, a key obstacle limiting the utility of MS data is the ambiguity arising from the co-expression of multiple HLA alleles. Here, we introduce a strategy for profiling the HLA ligandome one allele at a time. By using cell lines expressing a single HLA allele, optimizing immunopurifications, and developing a novel spectral search algorithm, we identified thousands of peptides bound to 16 different HLA class I alleles. These data enabled the discovery of novel binding motifs, and an integrative analysis quantifying the contribution of factors critical to epitope presentation, such as protein cleavage and gene expression. We trained neural network prediction algorithms with our large dataset (>24,000 peptides) and outperformed algorithms trained on datasets of peptides with measured affinities. We thus demonstrate a scalable strategy for systematically learning the rules of endogenous antigen presentation.

2017-02-21 | GSE93315 | GEO

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data