Dataset Information

Towards linked open gene mutations data.

ABSTRACT: With the advent of high-throughput technologies, a great wealth of variation data is being produced. Such information may constitute the basis for correlation analyses between genotypes and phenotypes and, in the future, for personalized medicine. Several databases on gene variation exist, but this kind of information is still scarce in the Semantic Web framework. In this paper, we discuss issues related to the integration of mutation data in the Linked Open Data infrastructure, part of the Semantic Web framework. We present the development of a mapping from the IARC TP53 Mutation database to RDF and the implementation of servers publishing this data.A version of the IARC TP53 Mutation database implemented in a relational database was used as first test set. Automatic mappings to RDF were first created by using D2RQ and later manually refined by introducing concepts and properties from domain vocabularies and ontologies, as well as links to Linked Open Data implementations of various systems of biomedical interest. Since D2RQ query performances are lower than those that can be achieved by using an RDF archive, generated data was also loaded into a dedicated system based on tools from the Jena software suite.We have implemented a D2RQ Server for TP53 mutation data, providing data on a subset of the IARC database, including gene variations, somatic mutations, and bibliographic references. The server allows to browse the RDF graph by using links both between classes and to external systems. An alternative interface offers improved performances for SPARQL queries. The resulting data can be explored by using any Semantic Web browser or application.This has been the first case of a mutation database exposed as Linked Data. A revised version of our prototype, including further concepts and IARC TP53 Mutation database data sets, is under development.The publication of variation information as Linked Data opens new perspectives: the exploitation of SPARQL searches on mutation data and other biological databases may support data retrieval which is presently not possible. Moreover, reasoning on integrated variation data may support discoveries towards personalized medicine.

SUBMITTER: Zappa A

PROVIDER: S-EPMC3303732 | biostudies-literature | 2012 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Towards linked open gene mutations data.

Zappa Achille A Splendiani Andrea A Romano Paolo P

BMC bioinformatics 20120328

<h4>Background</h4>With the advent of high-throughput technologies, a great wealth of variation data is being produced. Such information may constitute the basis for correlation analyses between genotypes and phenotypes and, in the future, for personalized medicine. Several databases on gene variation exist, but this kind of information is still scarce in the Semantic Web framework. In this paper, we discuss issues related to the integration of mutation data in the Linked Open Data infrastructur ...[more]

PMID: 22536974

Similar Datasets

Project description:BACKGROUND: There is a growing need for efficient and integrated access to databases provided by diverse institutions. Using a linked data design pattern allows the diverse data on the Internet to be linked effectively and accessed efficiently by computers. Previously, we developed the Allie database, which stores pairs of abbreviations and long forms (LFs, or expanded forms) used in the life sciences. LFs define the semantics of abbreviations, and Allie provides a Web-based search service for researchers to look up the LF of an unfamiliar abbreviation. This service encounters two problems. First, it does not display each LF's definition, which could help the user to disambiguate and learn the abbreviations more easily. Furthermore, there are too many LFs for us to prepare a full dictionary from scratch. On the other hand, DBpedia has made the contents of Wikipedia available in the Resource Description Framework (RDF), which is expected to contain a significant number of entries corresponding to LFs. Therefore, linking the Allie LFs to DBpedia entries may present a solution to the Allie's problems. This requires a method that is capable of matching large numbers of string pairs within a reasonable period of time because Allie and DBpedia are frequently updated. RESULTS: We built a Linked Open Data set that links LFs to DBpedia titles by applying key collision methods (i.e., fingerprint and n-gram fingerprint) to their literals, which are simple approximate string-matching methods. In addition, we used UMLS resources to normalise the life science terms. As a result, combining the key collision methods with the domain-specific resources performed best, and 44,027 LFs have links to DBpedia titles. We manually evaluated the accuracy of the string matching by randomly sampling 1200 LFs, and our approach achieved an F-measure of 0.98. In addition, our experiments revealed the following. (1) Performances were similar independently from the frequency of the LFs in MEDLINE. (2) There is a relationship (r2 = 0.96, P < 0.01) between the occurrence frequencies of LFs in MEDLINE and their presence probabilities in DBpedia titles. CONCLUSIONS: The obtained results help Allie users locate the correct LFs. Because the methods are computationally simple and yield a high performance and because the most frequently used LFs in MEDLINE appear more often in DBpedia titles, we can continually and reasonably update the linked dataset to reflect the latest publications and additions to DBpedia. Joining LFs between scientific literature and DBpedia enables cross-resource exploration for mutual benefits.

Project description:BackgroundNext Generation Sequencing (NGS) is playing a key role in therapeutic decision making for the cancer prognosis and treatment. The NGS technologies are producing a massive amount of sequencing datasets. Often, these datasets are published from the isolated and different sequencing facilities. Consequently, the process of sharing and aggregating multisite sequencing datasets are thwarted by issues such as the need to discover relevant data from different sources, built scalable repositories, the automation of data linkage, the volume of the data, efficient querying mechanism, and information rich intuitive visualisation.ResultsWe present an approach to link and query different sequencing datasets (TCGA, COSMIC, REACTOME, KEGG and GO) to indicate risks for four cancer types - Ovarian Serous Cystadenocarcinoma (OV), Uterine Corpus Endometrial Carcinoma (UCEC), Uterine Carcinosarcoma (UCS), Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC) - covering the 16 healthy tissue-specific genes from Illumina Human Body Map 2.0. The differentially expressed genes from Illumina Human Body Map 2.0 are analysed together with the gene expressions reported in COSMIC and TCGA repositories leading to the discover of potential biomarkers for a tissue-specific cancer.ConclusionWe analyse the tissue expression of genes, copy number variation (CNV), somatic mutation, and promoter methylation to identify associated pathways and find novel biomarkers. We discovered twenty (20) mutated genes and three (3) potential pathways causing promoter changes in different gynaecological cancer types. We propose a data-interlinked platform called BIOOPENER that glues together heterogeneous cancer and biomedical repositories. The key approach is to find correspondences (or data links) among genetic, cellular and molecular features across isolated cancer datasets giving insight into cancer progression from normal to diseased tissues. The proposed BIOOPENER platform enriches mutations by filling in missing links from TCGA, COSMIC, REACTOME, KEGG and GO datasets and provides an interlinking mechanism to understand cancer progression from normal to diseased tissues with pathway components, which in turn helped to map mutations, associated phenotypes, pathways, and mechanism.

Dataset Information

Towards linked open gene mutations data.

Publications

Towards linked open gene mutations data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets