Dataset Information

Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data.

ABSTRACT: Background: Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data. Methods: Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC ₅₀ measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation. Results and Discussion: Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG. Conclusions: Thanks to this unbiased validation, we now know that this type of models can predict in vitro tumour response to some of these drugs. These models can thus be further investigated on in vivo tumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available at http://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz.

SUBMITTER: Nguyen L

PROVIDER: S-EPMC5310525 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Motivation:Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to independent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widely diverse articles in databases such as PubMed. Results:Here we aggregated published corpora for the recognition of biomolecular entities (such as genes, RNA, proteins, variants, drugs and metabolites), identified entity class overlap and performed leave-corpus-out cross validation strategy to test the efficiency of existing models. We demonstrate that accuracies of models trained on individual corpora decrease substantially for recognition of the same biomolecular entity classes in independent corpora. This behavior is possibly due to limited generalizability of entity-class-related features captured by individual corpora (model 'overtraining') which we investigated further at the orthographic level, as well as potential annotation standard differences. We show that the combined use of multi-source training corpora results in overall more generalizable models for named entity recognition, while achieving comparable individual performance. By performing learning-curve-based power analysis we further identified that performance is often not limited by the quantity of the annotated data. Availability and implementation:Compiled primary and secondary sources of the aggregated corpora are available on: https://github.com/dterg/biomedical_corpora/wiki and https://bitbucket.org/iAnalytica/bioner. Supplementary information:Supplementary data are available at Bioinformatics online.

Project description:BackgroundThe collection of individual-level pandemic (H1N1) 2009 influenza immunization data was considered important to facilitate optimal vaccine delivery and accurate assessment of vaccine coverage. These data are also critical for research aimed at evaluating the new vaccine's safety and effectiveness. Systems used to collect immunization data include manual approaches in which data are collected and retained on paper, electronic systems in which data are captured on computer at the point of vaccination and hybrid systems which are comprised of both computerized and manual data collection components. This study's objective was to compare the efficiencies and perceptions of data collection methods employed during Canada's pandemic (H1N1) 2009 influenza vaccination campaign.Methods/designA pan-Canadian observational study was conducted in a convenience sample of public health clinics and healthcare institutions during the H1N1 vaccination campaign in the fall of 2009. The study design consisted of three stages: Stage 1 involved passive observation of the site's layout, processes and client flow; Stage 2 entailed timing site staff on 20 clients through five core immunization tasks: i) client registration, ii) medical history collection, iii) medical history review, iv) vaccine administration record keeping and v) preparation of proof of vaccine administration for the client; in Stage 3, site staff completed a questionnaire regarding perceived usability of the site's data collection approach. Before the national study began, a pilot study was conducted in three seasonal influenza vaccination sites in Ontario, to both test that the proposed methodology was logistically feasible and to determine inter-rater reliability in the measurements of the research staff. Comparative analyses will be conducted across the range of data collection methods with respect to time required to collect immunization data, number and type of individual-level data elements collected, and clinic staff perceptions of the usability of the method employed at their site, using analysis of variance (ANOVA).DiscussionVarious data collection methods were employed at immunization sites across Canada during the pandemic (H1N1) 2009 influenza vaccination campaign. Our comparison of methods can facilitate planning an efficient, coordinated approach for collecting immunization data in future influenza seasons.

Dataset Information

Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets