Unknown

Dataset Information

0

Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes.


ABSTRACT: Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes.

SUBMITTER: Peng H 

PROVIDER: S-EPMC5668007 | biostudies-other | 2017 Oct

REPOSITORIES: biostudies-other

altmetric image

Publications

Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes.

Peng Hui H   Lan Chaowang C   Liu Yuansheng Y   Liu Tao T   Blumenstein Michael M   Li Jinyan J  

Oncotarget 20170824 45


Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease ge  ...[more]

Similar Datasets

| S-EPMC2773258 | biostudies-literature
| S-EPMC3089475 | biostudies-other
| S-EPMC9332188 | biostudies-literature
2022-02-25 | PXD005291 | Pride
| S-EPMC2662882 | biostudies-literature
| S-EPMC2603021 | biostudies-literature
| S-EPMC3441527 | biostudies-literature
| S-EPMC2422843 | biostudies-literature
| S-EPMC9278741 | biostudies-literature
| S-EPMC5042011 | biostudies-literature