Dataset Information

A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE).

ABSTRACT: Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies. Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu.

SUBMITTER: Wu TJ

PROVIDER: S-EPMC3965850 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE).

Wu Tsung-Jung TJ Shamsaddini Amirhossein A Pan Yang Y Smith Krista K Crichton Daniel J DJ Simonyan Vahan V Mazumder Raja R

Database : the journal of biological databases and curation 20140325

Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more ...[more]

PMID: 24667251

Similar Datasets

Project description:BackgroundWith the recent growth of information on sequence variations in the human genome, predictions regarding the functional effects and relevance to disease phenotypes of coding sequence variations are becoming increasingly important. The aims of this study were to catalog protein-coding sequence variations (CVs) occurring in genetic variation databases and to use bioinformatic programs to analyze CVs. In addition, we aim to provide insight into the functionality of the reference databases.Methodology and findingsTo catalog CVs on a genome-wide scale with regard to protein function and disease, we investigated three representative databases; the Human Gene Mutation Database (HGMD), the Single Nucleotide Polymorphisms database (dbSNP), and the Haplotype Map (HapMap). Using these three databases, we analyzed CVs at the protein function level with bioinformatic programs. We proposed a combinatorial approach using the Support Vector Machine (SVM) to increase the performance of the prediction programs. By cataloging the coding sequence variations using these databases, we found that 4.36% of CVs from HGMD are concurrently registered in dbSNP (8.11% of CVs from dbSNP are concurrent in HGMD). The pattern of substitutions and functional consequences predicted by three bioinformatic programs was significantly different among concurrent CVs, and CVs occurring solely in HGMD or in dbSNP. The experimental results showed that the proposed SVM combination noticeably outperformed the individual prediction programs.ConclusionsThis is the first study to compare human sequence variations in HGMD, dbSNP and HapMap at the genome-wide level. We found that a significant proportion of CVs in HGMD and dbSNP overlap, and we emphasize the need to use caution when interpreting the phenotypic relevance of these concurrent CVs. Combining bioinformatic programs can be helpful in predicting the functional consequences of CVs because it improved the performance of functional predictions.

Project description:BackgroundThe Calgary Audit and Feedback Framework (CAFF) is a pragmatic, evidence-based approach for the design and implementation of in-person social learning interventions using Audit and Group Feedback (AGF). This report describes extension of CAFF into the virtual environment as part of a multifaceted intervention bundle to reduce redundant daily laboratory testing in hospitals. We evaluate the process of extending CAFF in the virtual environment and share resulting evidence of participant engagement with planning for practice change.MethodsWe describe an innovative virtually facilitated AGF intervention based on the CAFF. The AGF intervention was part of an intervention bundle which included individual physician laboratory test utilization reports and educational tools to reduce redundant daily laboratory testing in hospitals. We used data from recorded and transcribed virtual AGF sessions, post AGF session surveys and detailed field notes maintained by project team members. We used simple descriptive statistics for quantitative data and analyzed qualitative data according to the elements of CAFF.ResultsEighty-three physicians participated over twelve virtual AGF sessions conducted across four tertiary care hospitals during the study period. We demonstrate that all prerequisite activities for CAFF (relationship building, question choice and data representation) were present in every virtual AGF session. Virtual facilitation was effective in supporting the transition of participants through different steps of CAFF in each session to lead to change talk and planning. All participants contributed to discussion during the AGF sessions. The post AGF session surveys were filled by 66% of participants (55/83), with over 90% of respondents reporting that the session helped them improve practice. 46% of participants (38/83) completed personal commitment to change forms at the end of the sessions.ConclusionsVirtual AGF sessions, developed and implemented with fidelity to the CAFF approach, successfully engaged physicians in a group learning environment that led to change planning. Further studies are needed to determine the generalizability of our findings and to add to the literature on evidence-based virtual facilitation techniques.

Dataset Information

A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE).

Publications

A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE).

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets