Dataset Information

Some Statistical Strategies for DAE-seq Data Analysis: Variable Selection and Modeling Dependencies among Observations.

ABSTRACT: In DAE (DNA After Enrichment)-seq experiments, genomic regions related with certain biological processes are enriched/isolated by an assay and are then sequenced on a high-throughput sequencing platform to determine their genomic positions. Statistical analysis of DAE-seq data aims to detect genomic regions with significant aggregations of isolated DNA fragments ("enriched regions") versus all the other regions ("background"). However, many confounding factors may influence DAE-seq signals. In addition, the signals in adjacent genomic regions may exhibit strong correlations, which invalidate the independence assumption employed by many existing methods. To mitigate these issues, we develop a novel Autoregressive Hidden Markov Model (AR-HMM) to account for covariates effects and violations of the independence assumption. We demonstrate that our AR-HMM leads to improved performance in identifying enriched regions in both simulated and real datasets, especially in those in epigenetic datasets with broader regions of DAE-seq signal enrichment. We also introduce a variable selection procedure in the context of the HMM/AR-HMM where the observations are not independent and the mean value of each state-specific emission distribution is modeled by some covariates. We study the theoretical properties of this variable selection procedure and demonstrate its efficacy in simulated and real DAE-seq data. In summary, we develop several practical approaches for DAE-seq data analysis that are also applicable to more general problems in statistics.

SUBMITTER: Rashid NU

PROVIDER: S-EPMC3963211 | biostudies-literature | 2014 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Some Statistical Strategies for DAE-seq Data Analysis: Variable Selection and Modeling Dependencies among Observations.

Rashid Naim U NU Sun Wei W Ibrahim Joseph G JG

Journal of the American Statistical Association 20140101 505

In DAE (DNA After Enrichment)-seq experiments, genomic regions related with certain biological processes are enriched/isolated by an assay and are then sequenced on a high-throughput sequencing platform to determine their genomic positions. Statistical analysis of DAE-seq data aims to detect genomic regions with significant aggregations of isolated DNA fragments ("enriched regions") versus all the other regions ("background"). However, many confounding factors may influence DAE-seq signals. In a ...[more]

PMID: 24678134

Dataset Information

Some Statistical Strategies for DAE-seq Data Analysis: Variable Selection and Modeling Dependencies among Observations.

Publications

Some Statistical Strategies for DAE-seq Data Analysis: Variable Selection and Modeling Dependencies among Observations.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Analysis of survival data with cure fraction and variable selection: A pseudo-observations approach.
| S-EPMC9660265 | biostudies-literature

LTMG: a novel statistical modeling of transcriptional expression states in single-cell RNA-Seq data.
| S-EPMC6765121 | biostudies-literature

glmgraph: an R package for variable selection and predictive modeling of structured genomic data.
| S-EPMC4692967 | biostudies-literature

MOCHA's advanced statistical modeling of scATAC-seq data enables functional genomic inference in large human cohorts.
| S-EPMC11316085 | biostudies-literature

Enhancing site selection strategies in clinical trial recruitment using real-world data modeling.
| S-EPMC10927105 | biostudies-literature

Penalized variable selection in multi-parameter regression survival modeling.
| S-EPMC10710000 | biostudies-literature

Variable selection in microbiome compositional data analysis.
| S-EPMC7671404 | biostudies-literature

Optimized variable selection via repeated data splitting.
| S-EPMC8547352 | biostudies-literature

Variable selection strategies and its importance in clinical prediction modelling.
| S-EPMC7032893 | biostudies-literature

Statistical analysis of genetic interactions in Tn-Seq data.
| S-EPMC5499643 | biostudies-literature