Dataset Information

Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates.

ABSTRACT: RNA-Sequencing has been the leading technology to quantify expression of thousands of genes simultaneously. The data analysis of an RNA-Seq experiment starts from aligning short reads to the reference genome/transcriptome or reconstructed transcriptome. However, current aligners lack the sensitivity to distinguish reads that come from homologous regions of an genome. One group of these homologies is the paralog pseudogenes. Pseudogenes arise from duplication of a set of protein coding genes, and have been considered as degraded paralogs in the genome due to their lost of functionality. Recent studies have provided evidence to support their novel regulatory roles in biological processes. With the growing interests in quantifying the expression level of pseudogenes at different tissues or cell lines, it is critical to have a sensitive method that can correctly align ambiguous reads and accurately estimate the expression level among homologous genes. Previously in PseudoLasso, we proposed a linear regression approach to learn read alignment behaviors, and to leverage this knowledge for abundance estimation and alignment correction. In this paper, we extend the work of PseudoLasso by grouping the homologous genomic regions into different communities using a community detection algorithm, followed by building a linear regression model separately for each community. The results show that this approach is able to retain the same accuracy as PseudoLasso. By breaking the genome into smaller homologous communities, the running time is improved from quadratic growth to linear with respect to the number of genes.

SUBMITTER: Ju CJ

PROVIDER: S-EPMC5514313 | biostudies-literature | 2017 May-Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates.

Ju Chelsea J-T CJ Zhao Zhuangtian Z Wang Wei W

IEEE/ACM transactions on computational biology and bioinformatics 20160714 3

RNA-Sequencing has been the leading technology to quantify expression of thousands of genes simultaneously. The data analysis of an RNA-Seq experiment starts from aligning short reads to the reference genome/transcriptome or reconstructed transcriptome. However, current aligners lack the sensitivity to distinguish reads that come from homologous regions of an genome. One group of these homologies is the paralog pseudogenes. Pseudogenes arise from duplication of a set of protein coding genes, and ...[more]

PMID: 27429446

Dataset Information

Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates.

Publications

Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Hobbes: optimized gram-based methods for efficient read alignment.
| S-EPMC3315303 | biostudies-literature

RNA-seq read alignment evaluation
2013-07-15 | E-MTAB-1728 | biostudies-arrayexpress

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.
| S-EPMC7185338 | biostudies-literature

RNA-Seq alignment to individualized genomes improves transcript abundance estimates in multiparent populations.
| S-EPMC4174954 | biostudies-literature

Improving PacBio long read accuracy by short read alignment.
| S-EPMC3464235 | biostudies-literature

Fast and accurate read alignment for resequencing.
| S-EPMC3436849 | biostudies-literature

Fast gapped-read alignment with Bowtie 2.
| S-EPMC3322381 | biostudies-literature

Performance optimization in DNA short-read alignment.
| S-EPMC10060706 | biostudies-literature

Gene-pseudogene evolution: a probabilistic approach.
| S-EPMC4602177 | biostudies-literature

RNA-seq read alignment evaluation
| PRJEB4265 | ENA