Dataset Information

ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data.

ABSTRACT:

Background

Finding significant genes or proteins from gene chip data for disease diagnosis and drug development is an important task. However, the challenge comes from the curse of the data dimension. It is of great significance to use machine learning methods to find important features from the data and build an accurate classification model.

Results

The proposed method has proved superior to the published advanced hybrid feature selection method and traditional feature selection method on different public microarray data sets. In addition, the biomarkers selected using our method show a match to those provided by the cooperative hospital in a set of clinical cleft lip and palate data.

Method

In this paper, a feature selection algorithm ILRC based on clustering and improved L1 regularization is proposed. The features are firstly clustered, and the redundant features in the sub-clusters are deleted. Then all the remaining features are iteratively evaluated using ILR. The final result is given according to the cumulative weight reordering.

Conclusion

The proposed method can effectively remove redundant features. The algorithm's output has high stability and classification accuracy, which can potentially select potential biomarkers.

SUBMITTER: Yu K

PROVIDER: S-EPMC8532312 | biostudies-literature | 2021 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data.

Yu Kun K Xie Weidong W Wang Linjie L Li Wei W

BMC bioinformatics 20211022 1

<h4>Background</h4>Finding significant genes or proteins from gene chip data for disease diagnosis and drug development is an important task. However, the challenge comes from the curse of the data dimension. It is of great significance to use machine learning methods to find important features from the data and build an accurate classification model.<h4>Results</h4>The proposed method has proved superior to the published advanced hybrid feature selection method and traditional feature selection ...[more]

PMID: 34686127

Similar Datasets

Project description:BackgroundBiomarker detection presents itself as a major means of translating biological data into clinical applications. Due to the recent advances in high throughput sequencing technologies, an increased number of metagenomics studies have suggested the dysbiosis in microbial communities as potential biomarker for certain diseases. The reproducibility of the results drawn from metagenomic data is crucial for clinical applications and to prevent incorrect biological conclusions. The variability in the sample size and the subjects participating in the experiments induce diversity, which may drastically change the outcome of biomarker detection algorithms. Therefore, a robust biomarker detection algorithm that ensures the consistency of the results irrespective of the natural diversity present in the samples is needed.ResultsToward this end, this paper proposes a novel Regularized Low Rank-Sparse Decomposition (RegLRSD) algorithm. RegLRSD models the bacterial abundance data as a superposition between a sparse matrix and a low-rank matrix, which account for the differentially and non-differentially abundant microbes, respectively. Hence, the biomarker detection problem is cast as a matrix decomposition problem. In order to yield more consistent and solid biological conclusions, RegLRSD incorporates the prior knowledge that the irrelevant microbes do not exhibit significant variation between samples belonging to different phenotypes. Moreover, an efficient algorithm to extract the sparse matrix is proposed. Comprehensive comparisons of RegLRSD with the state-of-the-art algorithms on three realistic datasets are presented. The obtained results demonstrate that RegLRSD consistently outperforms the other algorithms in terms of reproducibility performance and provides a marker list with high classification accuracy.ConclusionsThe proposed RegLRSD algorithm for biomarker detection provides high reproducibility and classification accuracy performance regardless of the dataset complexity and the number of selected biomarkers. This renders RegLRSD as a reliable and powerful tool for identifying potential metagenomic biomarkers.

Project description:Motivation: Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes. Our objective was to identify biomarker genes that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. Results: We developed a discriminant analysis and cluster (DAC) pipeline to analyze a 248-array dataset. First, a total of 869 significantly changed genes in response to TNT or RDX exposure were inferred by class comparison statistical algorithms. Then, nine decision tree-based algorithms were applied to generate classification rules and a set of 286 classifier genes. These classifier genes were ranked by their overall weight of significance, and were used to build support vector machines (SVMs). An SVM containing all 286 classifier genes had the highest classification accuracy (91.5%). An unsupervised clustering method was used to cluster the worm samples and results show that the use of the top 100 classifier genes can assign the largest number of worm samples into the three reference clusters obtained by using all the 14,188 filtered genes, suggesting that these top-ranked genes may be potential candidates for biomarkers. This study demonstrates that the DAC pipeline can be used to identify a small set of biomarker genes from high dimensional datasets and generate a reliable SVM classification model for multiple classes. Adult earthworms (E. fetida) were exposed in soil spiked with TNT (0, 6, 12, 24, 48, or 96 mg/kg) or RDX (8, 16, 32, 64, or 128 mg/kg) for 0, 4 or 14 days. The 4-day treatment was repeated with RDX concentration being 2, 4, 8, 16 or 32 mg/kg soil. Each treatment originally had 10 replicate worms with 8-10 survivors at the end of exposure. Total RNA was isolated from the surviving worms. A total of 248 worm RNA samples were hybridized to a custom-designed oligo array using Agilentâs one-color Low RNA Input Linear Amplification Kit. The array contains 15,208 non-redundant 60-mer probes, each targeting a unique E. fetida transcript (Gong et al. 2009). After hybridization and scanning, gene expression data were acquired using Agilentâs Feature Extraction Software (v.9.1.3). The 248-array dataset consists of three worm groups: 32 untreated controls, 96 TNT-treated, and 120 RDX-treated.

Project description:BackgroundMolecular profiling generates abundance measurements for thousands of gene transcripts in biological samples such as normal and tumor tissues (data points). Given such two-class high-dimensional data, many methods have been proposed for classifying data points into one of the two classes. However, finding very small sets of features able to correctly classify the data is problematic as the fundamental mathematical proposition is hard. Existing methods can find "small" feature sets, but give no hint how close this is to the true minimum size. Without fundamental mathematical advances, finding true minimum-size sets will remain elusive, and more importantly for the microarray community there will be no methods for finding them.ResultsWe use the brute force approach of exhaustive search through all genes, gene pairs (and for some data sets gene triples). Each unique gene combination is analyzed with a few-parameter linear-hyperplane classification method looking for those combinations that form training error-free classifiers. All 10 published data sets studied are found to contain predictive small feature sets. Four contain thousands of gene pairs and 6 have single genes that perfectly discriminate.ConclusionThis technique discovered small sets of genes (3 or less) in published data that form accurate classifiers, yet were not reported in the prior publications. This could be a common characteristic of microarray data, thus making looking for them worth the computational cost. Such small gene sets could indicate biomarkers and portend simple medical diagnostic tests. We recommend checking for small gene sets routinely. We find 4 gene pairs and many gene triples in the large hepatocellular carcinoma (HCC, Liver cancer) data set of Chen et al. The key component of these is the "placental gene of unknown function", PLAC8. Our HMM modeling indicates PLAC8 might have a domain like part of lP59's crystal structure (a Non-Covalent Endonuclease lii-Dna Complex). The previously identified HCC biomarker gene, glypican 3 (GPC3), is part of an accurate gene triple involving MT1E and ARHE. We also find small gene sets that distinguish leukemia subtypes in the large pediatric acute lymphoblastic leukemia cancer set of Yeoh et al.

Dataset Information

ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data.

Background

Results

Method

Conclusion

Publications

ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets