Dataset Information

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants.

ABSTRACT: Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.

SUBMITTER: Schubach M

PROVIDER: S-EPMC5462751 | biostudies-literature | 2017 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants.

Schubach Max M Re Matteo M Robinson Peter N PN Valentini Giorgio G

Scientific reports 20170607 1

Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ...[more]

PMID: 28592878

Dataset Information

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants.

Publications

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Common, low-frequency, rare, and ultra-rare coding variants contribute to COVID-19 severity.
| S-EPMC8661833 | biostudies-literature

Exome sequence analysis identifies rare coding variants associated with a machine learning-based marker for coronary artery disease.
| S-EPMC11781350 | biostudies-literature

A cell type-aware framework for nominating non-coding variants in Mendelian regulatory disorders
2024-08-06 | GSE254090 | GEO

Glycemic-aware metrics and oversampling techniques for predicting blood glucose levels using machine learning.
| S-EPMC6886807 | biostudies-literature

Sequencing rare and common APOL1 coding variants to determine kidney disease risk.
| S-EPMC4591109 | biostudies-literature

A cell type-aware framework for nominating non-coding variants in Mendelian regulatory disorders.
| S-EPMC11436875 | biostudies-literature

A cell type-aware framework for nominating non-coding variants in Mendelian regulatory disorders.
| S-EPMC10793524 | biostudies-literature

Unraveling the role of non-coding rare variants in epilepsy.
| S-EPMC10529579 | biostudies-literature

Investigation of Rare Non-Coding Variants in Familial Multiple Myeloma.
| S-EPMC9818386 | biostudies-literature

Rare Coding Variants in Patients with Non-Syndromic Vestibular Dysfunction.
| S-EPMC10137884 | biostudies-literature