Browse
Submit Data
Databases
API
Help

Dataset Information

32 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data.

ABSTRACT: Cancer classification is a topic of major interest in medicine since it allows accurate and efficient diagnosis and facilitates a successful outcome in medical treatments. Previous studies have classified human tumors using a large-scale RNA profiling and supervised Machine Learning (ML) algorithms to construct a molecular-based classification of carcinoma cells from breast, bladder, adenocarcinoma, colorectal, gastro esophagus, kidney, liver, lung, ovarian, pancreas, and prostate tumors. These datasets are collectively known as the 11_tumor database, although this database has been used in several works in the ML field, no comparative studies of different algorithms can be found in the literature. On the other hand, advances in both hardware and software technologies have fostered considerable improvements in the precision of solutions that use ML, such as Deep Learning (DL). In this study, we compare the most widely used algorithms in classical ML and DL to classify the tumors described in the 11_tumor database. We obtained tumor identification accuracies between 90.6% (Logistic Regression) and 94.43% (Convolutional Neural Networks) using k-fold cross-validation. Also, we show how a tuning process may or may not significantly improve algorithms' accuracies. Our results demonstrate an efficient and accurate classification method based on gene expression (microarray data) and ML/DL algorithms, which facilitates tumor type prediction in a multi-cancer-type scenario.

SUBMITTER: Tabares-Soto R

PROVIDER: S-EPMC7924492 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Analysis of Expression Pattern of snoRNAs in Different Cancer Types with Machine Learning Algorithms.

Project description:Small nucleolar RNAs (snoRNAs) are a new type of functional small RNAs involved in the chemical modifications of rRNAs, tRNAs, and small nuclear RNAs. It is reported that they play important roles in tumorigenesis via various regulatory modes. snoRNAs can both participate in the regulation of methylation and pseudouridylation and regulate the expression pattern of their host genes. This research investigated the expression pattern of snoRNAs in eight major cancer types in TCGA via several machine learning algorithms. The expression levels of snoRNAs were first analyzed by a powerful feature selection method, Monte Carlo feature selection (MCFS). A feature list and some informative features were accessed. Then, the incremental feature selection (IFS) was applied to the feature list to extract optimal features/snoRNAs, which can make the support vector machine (SVM) yield best performance. The discriminative snoRNAs included HBII-52-14, HBII-336, SNORD123, HBII-85-29, HBII-420, U3, HBI-43, SNORD116, SNORA73B, SCARNA4, HBII-85-20, etc., on which the SVM can provide a Matthew's correlation coefficient (MCC) of 0.881 for predicting these eight cancer types. On the other hand, the informative features were fed into the Johnson reducer and repeated incremental pruning to produce error reduction (RIPPER) algorithms to generate classification rules, which can clearly show different snoRNAs expression patterns in different cancer types. The analysis results indicated that extracted discriminative snoRNAs can be important for identifying cancer samples in different types and the expression pattern of snoRNAs in different cancer types can be partly uncovered by quantitative recognition rules.

| S-EPMC6539089 | biostudies-literature

Deep Learning Algorithms Correctly Classify <i>Brassica rapa</i> Varieties Using Digital Images.

Project description:Efficient and accurate methods of analysis are needed for the huge amount of biological data that have accumulated in various research fields, including genomics, phenomics, and genetics. Artificial intelligence (AI)-based analysis is one promising method to manipulate biological data. To this end, various algorithms have been developed and applied in fields such as disease diagnosis, species classification, and object prediction. In the field of phenomics, classification of accessions and variants is important for basic science and industrial applications. To construct AI-based classification models, three types of phenotypic image data were generated from 156 Brassica rapa core collections, and classification analyses were carried out using four different convolutional neural network architectures. The results of lateral view data showed higher accuracy compared with top view data. Furthermore, the relatively low accuracy of ResNet50 architecture suggested that definition and estimation of similarity index of phenotypic data were required before the selection of deep learning architectures.

| S-EPMC8511822 | biostudies-literature

Implementing Machine Learning Algorithms to Classify Postures and Forecast Motions When Using a Dynamic Chair.

Project description:Many modern jobs require long periods of sitting on a chair that may result in serious health complications. Dynamic chairs are proposed as alternatives to the traditional sitting chairs; however, previous studies have suggested that most users are not aware of their postures and do not take advantage of the increased range of motion offered by the dynamic chairs. Building a system that identifies users' postures in real time, as well as forecasts the next few postures, can bring awareness to the sitting behavior of each user. In this study, machine learning algorithms have been implemented to automatically classify users' postures and forecast their next motions. The random forest, gradient decision tree, and support vector machine algorithms were used to classify postures. The evaluation of the trained classifiers indicated that they could successfully identify users' postures with an accuracy above 90%. The algorithm can provide users with an accurate report of their sitting habits. A 1D-convolutional-LSTM network has also been implemented to forecast users' future postures based on their previous motions, the model can forecast a user's motions with high accuracy (97%). The ability of the algorithm to forecast future postures could be used to suggest alternative postures as needed.

| S-EPMC8749632 | biostudies-literature

Machine Learning Analysis of Physical Activity Data to Classify Postural Dysfunction.

Project description:BackgroundMachine learning (ML) analysis of biometric data in non-controlled environments is underexplored.ObjectiveTo evaluate whether ML analysis of physical activity data can be employed to classify whether individuals have postural dysfunction in middle-aged and older individuals.MethodsA 1 week period of physical activity was measured by a waist-worn uni-axial accelerometer during the 2003-2004 National Health and Nutrition Examination Survey sampling period. Features of physical activity along with basic demographic information (42 variables) were paired with ML models to predict the success or failure of a standard 30 s modified Romberg test during which participants had their eyes closed and stood upon a 3-inch compliant surface. Model performance was evaluated by area under the receiver operating characteristic curve (AUC-ROC), balanced accuracy, and F1-score.ResultsThe cohort was comprised of 1625 participants ≥40 years (median age 61, IQR 51-71). Approximately half (47%) were diagnosed with postural dysfunction having failed the binarized (pass/fail) scoring mechanism of the modified Romberg exam. Five ML models were trained on the classification task, achieving AUC values ranging from 0.67 to 0.73. The support vector machine (SVM) and a gradient-boosted model, XGBoost, achieved the highest AUC of 0.73 (SD 0.71-0.75). Age was the most important variable for SVM classification, followed by four features that evaluated accelerometer counts at various thresholds, including those delineating total, moderate, and moderate-vigorous activity.ConclusionsML analysis of accelerometer-derived physical activity data to classify postural dysfunction in middle-aged and older individuals is feasible in real-world environments such as the home.Level of evidence3 Laryngoscope, 133:3529-3533, 2023.

| S-EPMC10589386 | biostudies-literature

Using multiple machine learning algorithms to classify elite and sub-elite goalkeepers in professional men's football.

Project description:This study applied multiple machine learning algorithms to classify the performance levels of professional goalkeepers (GK). Technical performances of GK's competing in the elite divisions of England, Spain, Germany, and France were analysed in order to determine which factors distinguish elite GK's from sub-elite GK's. A total of (n = 14,671) player-match observations were analysed via multiple machine learning algorithms (MLA); Logistic Regressions (LR), Gradient Boosting Classifiers (GBC) and Random Forest Classifiers (RFC). The results revealed 15 common features across the three MLA's pertaining to the actions of passing and distribution, distinguished goalkeepers performing at the elite level from those that do not. Specifically, short distribution, passing the ball successfully, receiving passes successfully, and keeping clean sheets were all revealed to be common traits of GK's performing at the elite level. Moderate to high accuracy was reported across all the MLA's for the training data, LR (0.7), RFC (0.82) and GBC (0.71) and testing data, LR (0.67), RFC (0.66) and GBC (0.66). Ultimately, the results discovered in this study suggest that a GK's ability with their feet and not necessarily their hands are what distinguishes the elite GK's from the sub-elite.

| S-EPMC8609025 | biostudies-literature

Machine learning algorithms reveal unique gene expression profiles in muscle biopsies from patients with different types of myositis.

Project description:ObjectivesMyositis is a heterogeneous family of diseases that includes dermatomyositis (DM), antisynthetase syndrome (AS), immune-mediated necrotising myopathy (IMNM), inclusion body myositis (IBM), polymyositis and overlap myositis. Additional subtypes of myositis can be defined by the presence of myositis-specific autoantibodies (MSAs). The purpose of this study was to define unique gene expression profiles in muscle biopsies from patients with MSA-positive DM, AS and IMNM as well as IBM.MethodsRNA-seq was performed on muscle biopsies from 119 myositis patients with IBM or defined MSAs and 20 controls. Machine learning algorithms were trained on transcriptomic data and recursive feature elimination was used to determine which genes were most useful for classifying muscle biopsies into each type and MSA-defined subtype of myositis.ResultsThe support vector machine learning algorithm classified the muscle biopsies with >90% accuracy. Recursive feature elimination identified genes that are most useful to the machine learning algorithm and that are only overexpressed in one type of myositis. For example, CAMK1G (calcium/calmodulin-dependent protein kinase IG), EGR4 (early growth response protein 4) and CXCL8 (interleukin 8) are highly expressed in AS but not in DM or other types of myositis. Using the same computational approach, we also identified genes that are uniquely overexpressed in different MSA-defined subtypes. These included apolipoprotein A4 (APOA4), which is only expressed in anti-3-hydroxy-3-methylglutaryl-CoA reductase (HMGCR) myopathy, and MADCAM1 (mucosal vascular addressin cell adhesion molecule 1), which is only expressed in anti-Mi2-positive DM.ConclusionsUnique gene expression profiles in muscle biopsies from patients with MSA-defined subtypes of myositis and IBM suggest that different pathological mechanisms underly muscle damage in each of these diseases.

| S-EPMC10461844 | biostudies-literature

Prediction of lung tumor types based on protein attributes by machine learning algorithms.

Project description:Early diagnosis of lung cancers and distinction between the tumor types (Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC) are very important to increase the survival rate of patients. Herein, we propose a diagnostic system based on sequence-derived structural and physicochemical attributes of proteins that involved in both types of tumors via feature extraction, feature selection and prediction models. 1497 proteins attributes computed and important features selected by 12 attribute weighting models and finally machine learning models consist of seven SVM models, three ANN models and two NB models applied on original database and newly created ones from attribute weighting models; models accuracies calculated through 10-fold cross and wrapper validation (just for SVM algorithms). In line with our previous findings, dipeptide composition, autocorrelation and distribution descriptor were the most important protein features selected by bioinformatics tools. The algorithms performances in lung cancer tumor type prediction increased when they applied on datasets created by attribute weighting models rather than original dataset. Wrapper-Validation performed better than X-Validation; the best cancer type prediction resulted from SVM and SVM Linear models (82%). The best accuracy of ANN gained when Neural Net model applied on SVM dataset (88%). This is the first report suggesting that the combination of protein features and attribute weighting models with machine learning algorithms can be effectively used to predict the type of lung cancer tumors (SCLC and NSCLC).

| S-EPMC3710575 | biostudies-literature

Combining Multiple RNA-Seq Data Analysis Algorithms Using Machine Learning Improves Differential Isoform Expression Analysis.

Project description:RNA sequencing has become the standard technique for high resolution genome-wide monitoring of gene expression. As such, it often comprises the first step towards understanding complex molecular mechanisms driving various phenotypes, spanning organ development to disease genesis, monitoring and progression. An advantage of RNA sequencing is its ability to capture complex transcriptomic events such as alternative splicing which results in alternate isoform abundance. At the same time, this advantage remains algorithmically and computationally challenging, especially with the emergence of even higher resolution technologies such as single-cell RNA sequencing. Although several algorithms have been proposed for the effective detection of differential isoform expression from RNA-Seq data, no widely accepted golden standards have been established. This fact is further compounded by the significant differences in the output of different algorithms when applied on the same data. In addition, many of the proposed algorithms remain scarce and poorly maintained. Driven by these challenges, we developed a novel integrative approach that effectively combines the most widely used algorithms for differential transcript and isoform analysis using state-of-the-art machine learning techniques. We demonstrate its usability by applying it on simulated data based on several organisms, and using several performance metrics; we conclude that our strategy outperforms the application of the individual algorithms. Finally, our approach is implemented as an R Shiny application, with the underlying data analysis pipelines also available as docker containers.

| S-EPMC8544431 | biostudies-literature

Gene selection algorithms for microarray data based on least squares support vector machine.

Project description:BackgroundIn discriminant analysis of microarray data, usually a small number of samples are expressed by a large number of genes. It is not only difficult but also unnecessary to conduct the discriminant analysis with all the genes. Hence, gene selection is usually performed to select important genes.ResultsA gene selection method searches for an optimal or near optimal subset of genes with respect to a given evaluation criterion. In this paper, we propose a new evaluation criterion, named the leave-one-out calculation (LOOC, A list of abbreviations appears just above the list of references) measure. A gene selection method, named leave-one-out calculation sequential forward selection (LOOCSFS) algorithm, is then presented by combining the LOOC measure with the sequential forward selection scheme. Further, a novel gene selection algorithm, the gradient-based leave-one-out gene selection (GLGS) algorithm, is also proposed. Both of the gene selection algorithms originate from an efficient and exact calculation of the leave-one-out cross-validation error of the least squares support vector machine (LS-SVM). The proposed approaches are applied to two microarray datasets and compared to other well-known gene selection methods using codes available from the second author.ConclusionThe proposed gene selection approaches can provide gene subsets leading to more accurate classification results, while their computational complexity is comparable to the existing methods. The GLGS algorithm can also better scale to datasets with a very large number of genes.

| S-EPMC1409801 | biostudies-literature

Detection call algorithms for high-throughput gene expression microarray data.

Project description:Extensive methodological research has been conducted to improve gene expression summary methods. However, in addition to quantitative gene expression summaries, most platforms, including all those examined in the MicroArray Quality Control project, provide a qualitative detection call result for each gene on the platform. These detection call algorithms are intended to render an assessment of whether or not each transcript is reliably measured. In this paper, we review uses of these qualitative detection call results in the analysis of microarray data. We also review the detection call algorithms for two widely used gene expression microarray platforms, Affymetrix GeneChips and Illumina BeadArrays, and more clearly formalize the mathematical notation for the Illumina BeadArray detection call algorithm. Both algorithms result in a P-value which is then used for determining the qualitative detection calls. We examined the performance of these detection call algorithms and default parameters by applying the methods to two spike-in datasets. We show that the default parameters for qualitative detection calls yield few absent calls for high spike-in concentrations. When genes of interest are expected to be present at very low concentrations, spike-in datasets can be useful for appropriately adjusting the tuning parameters for qualitative detection calls.

| S-EPMC4110453 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data