Dataset Information

A novel similarity-measure for the analysis of genetic data in complex phenotypes.

ABSTRACT:

Background

Recent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machine Learning literature is limited to relatively few papers which are focused on the development and application of data mining methods for the analysis of genetic variability. On the other hand, these papers apply to genetic data procedures which had been developed for a different kind of analysis and do not take into account the peculiarities of population genetics. The aim of our study was to define a new similarity measure, specifically conceived for measuring the similarity between the genetic profiles of two groups of subjects (i.e., cases and controls) taking into account that genetic profiles are usually distributed in a population group according to the Hardy Weinberg equilibrium.

Results

We set up a new kernel function consisting of a similarity measure between groups of subjects genotyped for numerous genetic loci. This measure weighs different genetic profiles according to the estimates of gene frequencies at Hardy-Weinberg equilibrium in the population. We named this function the "Hardy-Weinberg kernel". The effectiveness of the Hardy-Weinberg kernel was compared to the performance of the well established linear kernel. We found that the Hardy-Weinberg kernel significantly outperformed the linear kernel in a number of experiments where we used either simulated data or real data.

Conclusion

The "Hardy-Weinberg kernel" reported here represents one of the first attempts at incorporating genetic knowledge into the definition of a kernel function designed for the analysis of genetic data. We show that the best performance of the "Hardy-Weinberg kernel" is observed when rare genotypes have different frequencies in cases and controls. The ability to capture the effect of rare genotypes on phenotypic traits might be a very important and useful feature, as most of the current statistical tools loose most of their statistical power when rare genotypes are involved in the susceptibility to the trait under study.

SUBMITTER: Lagani V

PROVIDER: S-EPMC2697648 | biostudies-literature | 2009 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A novel similarity-measure for the analysis of genetic data in complex phenotypes.

Lagani Vincenzo V Montesanto Alberto A Di Cianni Fausta F Moreno Victor V Landi Stefano S Conforti Domenico D Rose Giuseppina G Passarino Giuseppe G

BMC bioinformatics 20090616

<h4>Background</h4>Recent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machine Learning literature is limited to relatively few papers which are focused on the development and application of data mining methods for the analysis of genetic variabilit ...[more]

PMID: 19534750

Similar Datasets

Project description:BACKGROUND:Case-based reasoning is a proven method that relies on learned cases from the past for decision support of a new case. The accuracy of such a system depends on the applied similarity measure, which quantifies the similarity between two cases. This work proposes a collection of methods for similarity measures especially for comparison of clinical cases based on survival data, as they are available for example from clinical trials. METHODS:Our approach is intended to be used in scenarios, where it is of interest to use longitudinal data, such as survival data, for a case-based reasoning approach. This might be especially important, where uncertainty about the ideal therapy decision exists. The collection of methods consists of definitions of the local similarity of nominal as well as numeric attributes, a calculation of attribute weights, a feature selection method and finally a global similarity measure. All of them use survival time (consisting of survival status and overall survival) as a reference of similarity. As a baseline, we calculate a survival function for each value of any given clinical attribute. RESULTS:We define the similarity between values of the same attribute by putting the estimated survival functions in relation to each other. Finally, we quantify the similarity by determining the area between corresponding curves of survival functions. The proposed global similarity measure is designed especially for cases from randomized clinical trials or other collections of clinical data with survival information. Overall survival can be considered as an eligible and alternative solution for similarity calculations. It is especially useful, when similarity measures that depend on the classic solution-describing attribute "applied therapy" are not applicable. This is often the case for data from clinical trials containing randomized arms. CONCLUSIONS:In silico evaluation scenarios showed that the mean accuracy of biomarker detection in k = 10 most similar cases is higher (0.909-0.998) than for competing similarity measures, such as Heterogeneous Euclidian-Overlap Metric (0.657-0.831) and Discretized Value Difference Metric (0.535-0.671). The weight calculation method showed a more than six times (6.59-6.95) higher weight for biomarker attributes over non-biomarker attributes. These results suggest that the similarity measure described here is suitable for applications based on survival data.

Project description:BackgroundA critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known.ResultsTo better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (infectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than random forest, it was never <50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB.ConclusionsAltogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals.

Project description:Single-cell analysis of the transcriptome deepens our understanding of an individual cell's contribution to its microenvironment. Using single-cell analysis to study complex biological processes requires state-of-the-art computational tools. Assessing similarity is highly important for bioinformatics algorithms in order to determine correlations between biological information. Similarity can appear by chance, particularly for low expressed entities. This is especially relevant in single cell RNA-seq (scRNA-seq) because the read counts obtained are lower compared to bulk RNA-sequencing and therefore classic bioinformatics tools are insufficient to obtain reproducible results. Recently, a Bayesian correlation scheme, that assigns low correlation values to correlations coming from low expressed genes, has been proposed to assess similarity for bulk RNA-seq and miRNA. This Bayesian method uses a prior distribution before using empirical evidence. Our goal was to extend the properties of this Bayesian correlation scheme to scRNA-seq data. We assessed 3 ways to compute similarity. First, we computed the similarity of each pair of genes over all cells. Second, we identified specific cell populations and computed the correlation in those specific cells. Third, we computed the similarity of each pair of genes over all clusters, by including the total mRNA expression in those cells. To study the effect of the number of cells on the method, we did not rely on simulated data, we generated 4 scRNA-seq mouse liver cell libraries with a varying number of input cells. Results: We show that Bayesian correlations are more reproducible than Pearson correlations in all the scenarios studied. Compared to Pearson correlations, Bayesian correlations have a smaller dependence on the number of input cells. We demonstrate that the Bayesian correlation algorithm assigns high similarity values to genes with a biological relevance in a specific population. Significance: Our results demonstrate that Bayesian correlation is a robust similarity measure for scRNA-seq datasets. The Bayesian method allows researchers to study similarity between pairs of genes without discarding low expressed entities and to minimize biasing the results by fake correlations. Taken together, using our method of Bayesian correlation the reproducibility of scRNA-seq experiments is increased significantly.

Dataset Information

A novel similarity-measure for the analysis of genetic data in complex phenotypes.

Background

Results

Conclusion

Publications

A novel similarity-measure for the analysis of genetic data in complex phenotypes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets