Dataset Information

Improving the Quality of Positive Datasets for the Establishment of Machine Learning Models for pre-microRNA Detection.

ABSTRACT: MicroRNAs (miRNAs) are involved in the post-transcriptional regulation of protein abundance and thus have a great impact on the resulting phenotype. It is, therefore, no wonder that they have been implicated in many diseases ranging from virus infections to cancer. This impact on the phenotype leads to a great interest in establishing the miRNAs of an organism. Experimental methods are complicated which led to the development of computational methods for pre-miRNA detection. Such methods generally employ machine learning to establish models for the discrimination between miRNAs and other sequences. Positive training data for model establishment, for the most part, stems from miRBase, the miRNA registry. The quality of the entries in miRBase has been questioned, though. This unknown quality led to the development of filtering strategies in attempts to produce high quality positive datasets which can lead to a scarcity of positive data. To analyze the quality of filtered data we developed a machine learning model and found it is well able to establish data quality based on intrinsic measures. Additionally, we analyzed which features describing pre-miRNAs could discriminate between low and high quality data. Both models are applicable to data from miRBase and can be used for establishing high quality positive data. This will facilitate the development of better miRNA detection tools which will make the prediction of miRNAs in disease states more accurate. Finally, we applied both models to all miRBase data and provide the list of high quality hairpins.

SUBMITTER: Demirci MDS

PROVIDER: S-EPMC6042829 | biostudies-other | 2017 Jul

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Improving the Quality of Positive Datasets for the Establishment of Machine Learning Models for pre-microRNA Detection.

Demirci Müşerref Duygu Saçar MDS Allmer Jens J

Journal of integrative bioinformatics 20170728 2

MicroRNAs (miRNAs) are involved in the post-transcriptional regulation of protein abundance and thus have a great impact on the resulting phenotype. It is, therefore, no wonder that they have been implicated in many diseases ranging from virus infections to cancer. This impact on the phenotype leads to a great interest in establishing the miRNAs of an organism. Experimental methods are complicated which led to the development of computational methods for pre-miRNA detection. Such methods general ...[more]

PMID: 28753538

Dataset Information

Improving the Quality of Positive Datasets for the Establishment of Machine Learning Models for pre-microRNA Detection.

Publications

Improving the Quality of Positive Datasets for the Establishment of Machine Learning Models for pre-microRNA Detection.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Delineating the impact of machine learning elements in pre-microRNA detection.
| S-EPMC5374968 | biostudies-literature

Gait can reveal sleep quality with machine learning models.
| S-EPMC6760789 | biostudies-literature

Improving Machine Learning Diabetes Prediction Models for the Utmost Clinical Effectiveness.
| S-EPMC9698354 | biostudies-literature

Different scaling of linear models and deep learning in UKBiobank brain images versus machine-learning datasets.
| S-EPMC7447816 | biostudies-literature

Improving the Efficacy of Deep-Learning Models for Heart Beat Detection on Heterogeneous Datasets.
| S-EPMC8698903 | biostudies-literature

Machine learning-based detection of immune-mediated diseases from genome-wide cell-free DNA sequencing datasets
2022-09-14 | E-MTAB-11607 | biostudies-arrayexpress

Pre-existing and machine learning-based models for cardiovascular risk prediction.
| S-EPMC8076166 | biostudies-literature

Machine learning models for predicting pre-eclampsia: a systematic review protocol.
| S-EPMC10496701 | biostudies-literature

The impact of imputation quality on machine learning classifiers for datasets with missing values.
| S-EPMC10558448 | biostudies-literature

Predictive models for anti-tubercular molecules using machine learning on high-throughput biological screening datasets.
| S-EPMC3228709 | biostudies-literature