Unknown

Dataset Information

0

A distance-type measure approach to the analysis of copy number variation in DNA sequencing data.


ABSTRACT:

Background

The next generation sequencing technology allows us to obtain a large amount of short DNA sequence (DNA-seq) reads at a genome-wide level. DNA-seq data have been increasingly collected during the recent years. Count-type data analysis is a widely used approach for DNA-seq data. However, the related data pre-processing is based on the moving window method, in which a window size need to be defined in order to obtain count-type data. Furthermore, useful information can be reduced after data pre-processing for count-type data.

Results

In this study, we propose to analyze DNA-seq data based on the related distance-type measure. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. Our experimental data based simulation study confirms the advantages of distance-type measure approach in both detection power and detection accuracy. Furthermore, we propose artificial censoring for the distance data so that distances larger than a given value are considered potential outliers. Our purpose is to simplify the pre-processing of DNA-seq data. Statistically, we consider a mixture of right censored geometric distributions to model the distance data. Additionally, to reduce the GC-content bias, we extend the mixture model to a mixture of generalized linear models (GLMs). The estimation of model can be achieved by the Newton-Raphson algorithm as well as the Expectation-Maximization (E-M) algorithm. We have conducted simulations to evaluate the performance of our approach. Based on the rank based inverse normal transformation of distance data, we can obtain the related z-values for a follow-up analysis. For an illustration, an application to the DNA-seq data from a pair of normal and tumor cell lines is presented with a change-point analysis of z-values to detect DNA copy number alterations.

Conclusion

Our distance-type measure approach is novel. It does not require either a fixed or a sliding window procedure for generating count-type data. Its advantages have been demonstrated by our simulation studies and its practical usefulness has been illustrated by an experimental data application.

SUBMITTER: Biswas B 

PROVIDER: S-EPMC6456939 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC6829143 | biostudies-literature
| S-EPMC5909048 | biostudies-other
| S-EPMC3084615 | biostudies-literature
| S-EPMC3549847 | biostudies-literature
| S-EPMC3514678 | biostudies-literature
| S-EPMC3563612 | biostudies-literature
| S-EPMC4147927 | biostudies-literature
| S-EPMC6260772 | biostudies-literature
| S-EPMC8406611 | biostudies-literature
| S-EPMC3219132 | biostudies-literature