Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates.
Ontology highlight
ABSTRACT: RNA-Sequencing has been the leading technology to quantify expression of thousands of genes simultaneously. The data analysis of an RNA-Seq experiment starts from aligning short reads to the reference genome/transcriptome or reconstructed transcriptome. However, current aligners lack the sensitivity to distinguish reads that come from homologous regions of an genome. One group of these homologies is the paralog pseudogenes. Pseudogenes arise from duplication of a set of protein coding genes, and have been considered as degraded paralogs in the genome due to their lost of functionality. Recent studies have provided evidence to support their novel regulatory roles in biological processes. With the growing interests in quantifying the expression level of pseudogenes at different tissues or cell lines, it is critical to have a sensitive method that can correctly align ambiguous reads and accurately estimate the expression level among homologous genes. Previously in PseudoLasso, we proposed a linear regression approach to learn read alignment behaviors, and to leverage this knowledge for abundance estimation and alignment correction. In this paper, we extend the work of PseudoLasso by grouping the homologous genomic regions into different communities using a community detection algorithm, followed by building a linear regression model separately for each community. The results show that this approach is able to retain the same accuracy as PseudoLasso. By breaking the genome into smaller homologous communities, the running time is improved from quadratic growth to linear with respect to the number of genes.
SUBMITTER: Ju CJ
PROVIDER: S-EPMC5514313 | biostudies-literature | 2017 May-Jun
REPOSITORIES: biostudies-literature
ACCESS DATA