Dataset Information

Methods for a similarity measure for clinical attributes based on survival data analysis.

ABSTRACT: BACKGROUND:Case-based reasoning is a proven method that relies on learned cases from the past for decision support of a new case. The accuracy of such a system depends on the applied similarity measure, which quantifies the similarity between two cases. This work proposes a collection of methods for similarity measures especially for comparison of clinical cases based on survival data, as they are available for example from clinical trials. METHODS:Our approach is intended to be used in scenarios, where it is of interest to use longitudinal data, such as survival data, for a case-based reasoning approach. This might be especially important, where uncertainty about the ideal therapy decision exists. The collection of methods consists of definitions of the local similarity of nominal as well as numeric attributes, a calculation of attribute weights, a feature selection method and finally a global similarity measure. All of them use survival time (consisting of survival status and overall survival) as a reference of similarity. As a baseline, we calculate a survival function for each value of any given clinical attribute. RESULTS:We define the similarity between values of the same attribute by putting the estimated survival functions in relation to each other. Finally, we quantify the similarity by determining the area between corresponding curves of survival functions. The proposed global similarity measure is designed especially for cases from randomized clinical trials or other collections of clinical data with survival information. Overall survival can be considered as an eligible and alternative solution for similarity calculations. It is especially useful, when similarity measures that depend on the classic solution-describing attribute "applied therapy" are not applicable. This is often the case for data from clinical trials containing randomized arms. CONCLUSIONS:In silico evaluation scenarios showed that the mean accuracy of biomarker detection in k?=?10 most similar cases is higher (0.909-0.998) than for competing similarity measures, such as Heterogeneous Euclidian-Overlap Metric (0.657-0.831) and Discretized Value Difference Metric (0.535-0.671). The weight calculation method showed a more than six times (6.59-6.95) higher weight for biomarker attributes over non-biomarker attributes. These results suggest that the similarity measure described here is suitable for applications based on survival data.

SUBMITTER: Karmen C

PROVIDER: S-EPMC6805472 | biostudies-literature | 2019 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Methods for a similarity measure for clinical attributes based on survival data analysis.

Karmen Christian C Gietzelt Matthias M Knaup-Gregori Petra P Ganzinger Matthias M

BMC medical informatics and decision making 20191021 1

<h4>Background</h4>Case-based reasoning is a proven method that relies on learned cases from the past for decision support of a new case. The accuracy of such a system depends on the applied similarity measure, which quantifies the similarity between two cases. This work proposes a collection of methods for similarity measures especially for comparison of clinical cases based on survival data, as they are available for example from clinical trials.<h4>Methods</h4>Our approach is intended to be u ...[more]

PMID: 31638963

Similar Datasets

Project description:BackgroundRecent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machine Learning literature is limited to relatively few papers which are focused on the development and application of data mining methods for the analysis of genetic variability. On the other hand, these papers apply to genetic data procedures which had been developed for a different kind of analysis and do not take into account the peculiarities of population genetics. The aim of our study was to define a new similarity measure, specifically conceived for measuring the similarity between the genetic profiles of two groups of subjects (i.e., cases and controls) taking into account that genetic profiles are usually distributed in a population group according to the Hardy Weinberg equilibrium.ResultsWe set up a new kernel function consisting of a similarity measure between groups of subjects genotyped for numerous genetic loci. This measure weighs different genetic profiles according to the estimates of gene frequencies at Hardy-Weinberg equilibrium in the population. We named this function the "Hardy-Weinberg kernel". The effectiveness of the Hardy-Weinberg kernel was compared to the performance of the well established linear kernel. We found that the Hardy-Weinberg kernel significantly outperformed the linear kernel in a number of experiments where we used either simulated data or real data.ConclusionThe "Hardy-Weinberg kernel" reported here represents one of the first attempts at incorporating genetic knowledge into the definition of a kernel function designed for the analysis of genetic data. We show that the best performance of the "Hardy-Weinberg kernel" is observed when rare genotypes have different frequencies in cases and controls. The ability to capture the effect of rare genotypes on phenotypic traits might be a very important and useful feature, as most of the current statistical tools loose most of their statistical power when rare genotypes are involved in the susceptibility to the trait under study.

Project description:Single-cell analysis of the transcriptome deepens our understanding of an individual cell's contribution to its microenvironment. Using single-cell analysis to study complex biological processes requires state-of-the-art computational tools. Assessing similarity is highly important for bioinformatics algorithms in order to determine correlations between biological information. Similarity can appear by chance, particularly for low expressed entities. This is especially relevant in single cell RNA-seq (scRNA-seq) because the read counts obtained are lower compared to bulk RNA-sequencing and therefore classic bioinformatics tools are insufficient to obtain reproducible results. Recently, a Bayesian correlation scheme, that assigns low correlation values to correlations coming from low expressed genes, has been proposed to assess similarity for bulk RNA-seq and miRNA. This Bayesian method uses a prior distribution before using empirical evidence. Our goal was to extend the properties of this Bayesian correlation scheme to scRNA-seq data. We assessed 3 ways to compute similarity. First, we computed the similarity of each pair of genes over all cells. Second, we identified specific cell populations and computed the correlation in those specific cells. Third, we computed the similarity of each pair of genes over all clusters, by including the total mRNA expression in those cells. To study the effect of the number of cells on the method, we did not rely on simulated data, we generated 4 scRNA-seq mouse liver cell libraries with a varying number of input cells. Results: We show that Bayesian correlations are more reproducible than Pearson correlations in all the scenarios studied. Compared to Pearson correlations, Bayesian correlations have a smaller dependence on the number of input cells. We demonstrate that the Bayesian correlation algorithm assigns high similarity values to genes with a biological relevance in a specific population. Significance: Our results demonstrate that Bayesian correlation is a robust similarity measure for scRNA-seq datasets. The Bayesian method allows researchers to study similarity between pairs of genes without discarding low expressed entities and to minimize biasing the results by fake correlations. Taken together, using our method of Bayesian correlation the reproducibility of scRNA-seq experiments is increased significantly.

Project description:BackgroundThe Gene Ontology (GO) is a dynamic, controlled vocabulary that describes the cellular function of genes and proteins according to tree major categories: biological process, molecular function and cellular component. It has become widely used in many bioinformatics applications for annotating genes and measuring their semantic similarity, rather than their sequence similarity. Generally speaking, semantic similarity measures involve the GO tree topology, information content of GO terms, or a combination of both.ResultsHere we present a new semantic similarity measure called TopoICSim (Topological Information Content Similarity) which uses information on the specific paths between GO terms based on the topology of the GO tree, and the distribution of information content along these paths. The TopoICSim algorithm was evaluated on two human benchmark datasets based on KEGG pathways and Pfam domains grouped as clans, using GO terms from either the biological process or molecular function. The performance of the TopoICSim measure compared favorably to five existing methods. Furthermore, the TopoICSim similarity was also tested on gene/protein sets defined by correlated gene expression, using three human datasets, and showed improved performance compared to two previously published similarity measures. Finally we used an online benchmarking resource which evaluates any similarity measure against a set of 11 similarity measures in three tests, using gene/protein sets based on sequence similarity, Pfam domains, and enzyme classifications. The results for TopoICSim showed improved performance relative to most of the measures included in the benchmarking, and in particular a very robust performance throughout the different tests.ConclusionsThe TopoICSim similarity measure provides a competitive method with robust performance for quantification of semantic similarity between genes and proteins based on GO annotations. An R script for TopoICSim is available at http://bigr.medisin.ntnu.no/tools/TopoICSim.R .

Project description:ObjectiveTo compare restricted mean survival time- (RMST-) based methods with traditional survival methods when multiple covariates are of interest.Methods4405 osteosarcomas were captured from Surveillance, Epidemiology, and End Results Program Database. RMST-based methods included group comparison using Kaplan-Meier (KM) method, pseudovalue (PV) regression, and inverse probability of censoring probability (IPCW) regressions with group-specific and individual weights. Log-rank test, Wilcoxon test, Cox regression, and its extension with time-dependent variables were selected as traditional methods. Proportional hazard (PH) assumption and homogeneity of censoring mechanism assumption were assessed. We estimated hazard ratio (HR) and difference in RMST and explored their relationships.ResultsWhen covariate violated PH assumption, time-varying HR was inconvenient to report as a single value but PH assumption-free RMST allowed to report a single value of difference in RMST. In univariable analyses, using the difference in RMST calculated by KM method as reference, PV regressions (slope = 1.02 and R 2 = 0.98) and IPCW regressions with group-specific weights (slope = 0.98 and R 2 = 0.99) gave more consistent estimation than IPCW with individual weights (slope = 0.31 and R 2 = 0.06), moreover, PV regressions presented more robust statistical power than IPCW regressions with group-specific weights. In multivariable analyses, IPCW regression with group-specific weights was limited when multiple covariates violated homogeneity of censoring mechanism assumption. For covariates met PH assumption, well-fitted logarithmic relationships between HR and difference in RMST estimated by PV regression were observed in both univariable and multivariable analyses (R 2 = 0.97 and R 2 = 0.94, respectively), which supported the robustness of PV regression and possible conversion between the two effect measures.ConclusionsDifference in RMST is more interpretable than time-varying HR. The performance supports KM method and PV regression to be the preferred ones in RMST-based methods. IPCW regression can be an alternative sensitivity analysis. We encourage adoption of both traditional methods and RMST-based methods to present effects of covariates comprehensively.

Dataset Information

Methods for a similarity measure for clinical attributes based on survival data analysis.

Publications

Methods for a similarity measure for clinical attributes based on survival data analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets