Dataset Information

In silico gene prioritization by integrating multiple data sources.

ABSTRACT: Identifying disease genes is crucial to the understanding of disease pathogenesis, and to the improvement of disease diagnosis and treatment. In recent years, many researchers have proposed approaches to prioritize candidate genes by considering the relationship of candidate genes and existing known disease genes, reflected in other data sources. In this paper, we propose an expandable framework for gene prioritization that can integrate multiple heterogeneous data sources by taking advantage of a unified graphic representation. Gene-gene relationships and gene-disease relationships are then defined based on the overall topology of each network using a diffusion kernel measure. These relationship measures are in turn normalized to derive an overall measure across all networks, which is utilized to rank all candidate genes. Based on the informativeness of available data sources with respect to each specific disease, we also propose an adaptive threshold score to select a small subset of candidate genes for further validation studies. We performed large scale cross-validation analysis on 110 disease families using three data sources. Results have shown that our approach consistently outperforms other two state of the art programs. A case study using Parkinson disease (PD) has identified four candidate genes (UBB, SEPT5, GPR37 and TH) that ranked higher than our adaptive threshold, all of which are involved in the PD pathway. In particular, a very recent study has observed a deletion of TH in a patient with PD, which supports the importance of the TH gene in PD pathogenesis. A web tool has been implemented to assist scientists in their genetic studies.

SUBMITTER: Chen Y

PROVIDER: S-EPMC3123338 | biostudies-literature | 2011

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

In silico gene prioritization by integrating multiple data sources.

Chen Yixuan Y Wang Wenhui W Zhou Yingyao Y Shields Robert R Chanda Sumit K SK Elston Robert C RC Li Jing J

PloS one 20110624 6

Identifying disease genes is crucial to the understanding of disease pathogenesis, and to the improvement of disease diagnosis and treatment. In recent years, many researchers have proposed approaches to prioritize candidate genes by considering the relationship of candidate genes and existing known disease genes, reflected in other data sources. In this paper, we propose an expandable framework for gene prioritization that can integrate multiple heterogeneous data sources by taking advantage of ...[more]

PMID: 21731658

Similar Datasets

Project description:BackgroundProtein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source k-Nearest Neighbor (MS-kNN) algorithm for function prediction, which finds k-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions.ResultsWe report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-kNN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-kNN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-kNN was rather small.ConclusionsBased on our results, we have several useful insights: (1) the k-nearest neighbor algorithm is an efficient and effective model for protein function prediction; (2) it is beneficial to transfer functions across a wide range of organisms; (3) it is helpful to integrate multiple sources of protein information.

Project description:Invasive pneumococcal disease (IPD) is a vaccine-preventable disease characterized by the presence of Streptococcus pneumoniae in normally sterile sites. Since 2007, Italy has implemented an IPD national surveillance system (IPD-NSS). This system suffers from high rates of underreporting. To estimate the level of underreporting of IPD in 2016-2017 in Tuscany (Italy), we integrated data from IPD-NSS and two other regional data sources, i.e., Tuscany regional microbiological surveillance (Microbiological Surveillance and Antibiotic Resistance in Tuscany, SMART) and hospitalization discharge records (HDRs). We collected (1) notifications to IPD-NSS, (2) SMART records positive for S. pneumoniae from normally sterile sites, and (3) hospitalization records with IPD-related International Classification of Diseases, Ninth Revision, Clinical Modification (ICD9) codes in discharge diagnoses. We performed data linkage of the three sources to obtain a combined surveillance system (CSS). Using the CSS, we calculated the completeness of the three sources and performed a three-source log-linear capture-recapture analysis to estimate total IPD underreporting. In total, 127 IPD cases were identified from IPD-NSS, 320 were identified from SMART, and 658 were identified from HDRs. After data linkage, a total of 904 unique cases were detected. The average yearly CSS notification rate was 12.1/100,000 inhabitants. Completeness was 14.0% for IPD-NSS, 35.4% for SMART, and 72.8% for HDRs. The capture-recapture analysis suggested a total estimate of 3419 cases of IPD (95% confidence interval (CI): 1364-5474), corresponding to an underreporting rate of 73.7% (95% CI: 34.0-83.6) for CSS. This study shows substantial underreporting in the Tuscany IPD surveillance system. Integration of available data sources may be a useful approach to complement notification-based surveillance and provide decision-makers with better information to plan effective control strategies against IPD.

Dataset Information

In silico gene prioritization by integrating multiple data sources.

Publications

In silico gene prioritization by integrating multiple data sources.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets