Project description:In a routine clinical environment or clinical trial, a case report form or structured reporting template can be used to quickly generate uniform and consistent reports. Annotation and image markup (AIM), a project supported by the National Cancer Institute's cancer biomedical informatics grid, can be used to collect information for a case report form or structured reporting template. AIM is designed to store, in a single information source, (a) the description of pixel data with use of markups or graphical drawings placed on the image, (b) calculation results (which may or may not be directly related to the markups), and (c) supplemental information. To facilitate the creation of AIM annotations with data entry templates, an AIM template schema and an open-source template creation application were developed to assist clinicians, image researchers, and designers of clinical trials to quickly create a set of data collection items, thereby ultimately making image information more readily accessible.
Project description:INTRODUCTION:As the health system seeks to leverage large-scale data to inform population outcomes, the informatics community is developing tools for analysing these data. To support data quality assessment within such a tool, we extended the open-source software Observational Health Data Sciences and Informatics (OHDSI) to incorporate new functions useful for population health. METHODS:We developed and tested methods to measure the completeness, timeliness and entropy of information. The new data quality methods were applied to over 100?million clinical messages received from emergency department information systems for use in public health syndromic surveillance systems. DISCUSSION:While completeness and entropy methods were implemented by the OHDSI community, timeliness was not adopted as its context did not fit with the existing OHDSI domains. The case report examines the process and reasons for acceptance and rejection of ideas proposed to an open-source community like OHDSI.
Project description:Omics approaches, including genomics, transcriptomics, proteomics, epigenomics, microbiomics, and metabolomics, generate large data sets. Once they have been used to address initial study aims, these large data sets are extremely valuable to the greater research community for ancillary investigations. Repurposing available omics data sets provides data to address research questions, generate and test hypotheses, replicate findings, and conduct mega-analyses. Many well-characterized, longitudinal, epidemiological studies collected extensive phenotype data related to symptom occurrence and severity. While the main phenotype of interest for many of these studies was often not symptom related, these data were collected to better understand the primary phenotype of interest. A search for symptom data (i.e., cognitive impairment, fatigue, gastrointestinal distress/nausea, sleep, and pain) in the database of genotypes and phenotypes (dbGaP) revealed many studies that collected symptom and omics data. There is thus a real possibility for nurse scientists to be able to look at symptom data over time from thousands of individuals and use omics data to identify key biological underpinnings that account for the development and severity of symptoms without recruiting participants or generating any new data. The purpose of this article is to introduce the reader to resources that provide omics data to the research community for repurposing, provide guidance on using these databases, and encourage the use of these data to move symptom science forward.
Project description:The field of biodiversity informatics is in a massive, "grow-out" phase of creating and enabling large-scale biodiversity data resources. Because perhaps 90% of existing biodiversity data nonetheless remains unavailable for science and policy applications, the question arises as to how these existing and available data records can be mobilized most efficiently and effectively. This situation led to our analysis of several large-scale biodiversity datasets regarding birds and plants, detecting information gaps and documenting data "leakage" or attrition, in terms of data on taxon, time, and place, in each data record. We documented significant data leakage in each data dimension in each dataset. That is, significant numbers of data records are lacking crucial information in terms of taxon, time, and/or place; information on place was consistently the least complete, such that geographic referencing presently represents the most significant factor in degradation of usability of information from biodiversity information resources. Although the full process of digital capture, quality control, and enrichment is important to developing a complete digital record of existing biodiversity information, payoffs in terms of immediate data usability will be greatest with attention paid to the georeferencing challenge.
Project description:As genotype databases increase in size, so too do the number of detectable segments of identity by descent (IBD): segments of the genome where two individuals share an identical copy of one of their two parental haplotypes, due to shared ancestry. We show that given a large enough genotype database, these segments of IBD collectively overlap entire chromosomes, including instances of IBD that span multiple chromosomes, and can be used to accurately separate the alleles inherited from each parent across the entire genome. The resulting phase is not an improvement over state-of-the-art local phasing methods, but provides accurate long-range phasing that indicates which of two haplotypes in different regions of the genome, including different chromosomes, was inherited from the same parent. We are able to separate the DNA inherited from each parent completely, across the entire genome, with 98% median accuracy in a test set of 30,000 individuals. We estimate the IBD data requirements for accurate genome-wide phasing, and we propose a method for estimating confidence in the resulting phase. We show that our methods do not require the genotypes of close family, and that they are robust to genotype errors and missing data. In fact, our method can impute missing data accurately and correct genotype errors.
Project description:The search for new strategies for better understanding cardiovascular (CV) disease is a constant one, spanning multitudinous types of observations and studies. A comprehensive characterization of each disease state and its biomolecular underpinnings relies upon insights gleaned from extensive information collection of various types of data. Researchers and clinicians in CV biomedicine repeatedly face questions regarding which types of data may best answer their questions, how to integrate information from multiple datasets of various types, and how to adapt emerging advances in machine learning and/or artificial intelligence to their needs in data processing. Frequently lauded as a field with great practical and translational potential, the interface between biomedical informatics and CV medicine is challenged with staggeringly massive datasets. Successful application of computational approaches to decode these complex and gigantic amounts of information becomes an essential step toward realizing the desired benefits. In this review, we examine recent efforts to adapt informatics strategies to CV biomedical research: automated information extraction and unification of multifaceted -omics data. We discuss how and why this interdisciplinary space of CV Informatics is particularly relevant to and supportive of current experimental and clinical research. We describe in detail how open data sources and methods can drive discovery while demanding few initial resources, an advantage afforded by widespread availability of cloud computing-driven platforms. Subsequently, we provide examples of how interoperable computational systems facilitate exploration of data from multiple sources, including both consistently formatted structured data and unstructured data. Taken together, these approaches for achieving data harmony enable molecular phenotyping of CV diseases and unification of CV knowledge.
Project description:The Function Biomedical Informatics Research Network (FBIRN) developed methods and tools for conducting multi-scanner functional magnetic resonance imaging (fMRI) studies. Method and tool development were based on two major goals: 1) to assess the major sources of variation in fMRI studies conducted across scanners, including instrumentation, acquisition protocols, challenge tasks, and analysis methods, and 2) to provide a distributed network infrastructure and an associated federated database to host and query large, multi-site, fMRI and clinical data sets. In the process of achieving these goals the FBIRN test bed generated several multi-scanner brain imaging data sets to be shared with the wider scientific community via the BIRN Data Repository (BDR). The FBIRN Phase 1 data set consists of a traveling subject study of 5 healthy subjects, each scanned on 10 different 1.5 to 4 T scanners. The FBIRN Phase 2 and Phase 3 data sets consist of subjects with schizophrenia or schizoaffective disorder along with healthy comparison subjects scanned at multiple sites. In this paper, we provide concise descriptions of FBIRN's multi-scanner brain imaging data sets and details about the BIRN Data Repository instance of the Human Imaging Database (HID) used to publicly share the data.
Project description:Next-generation sequencing (NGS) diagnostic assays increasingly are becoming the standard of care in oncology practice. As the scale of an NGS laboratory grows, management of these assays requires organizing large amounts of information, including patient data, laboratory processes, genomic data, as well as variant interpretation and reporting. Although several Laboratory Information Systems and/or Laboratory Information Management Systems are commercially available, they may not meet all of the needs of a given laboratory, in addition to being frequently cost-prohibitive. Herein, we present the System for Informatics in the Molecular Pathology Laboratory (SIMPL), a free and open-source Laboratory Information System/Laboratory Information Management System for academic and nonprofit molecular pathology NGS laboratories, developed at the Genomic and Molecular Pathology Division at the University of Chicago Medicine. SIMPL was designed as a modular end-to-end information system to handle all stages of the NGS laboratory workload from test order to reporting. We describe the features of SIMPL, its clinical validation at University of Chicago Medicine, and its installation and testing within a different academic center laboratory (University of Colorado), and we propose a platform for future community co-development and interlaboratory data sharing.
Project description:In the recent biobank era of genetics, the problem of identical-by-descent (IBD) segment detection received renewed interest, as IBD segments in large cohorts offer unprecedented opportunities in the study of population and genealogical history, as well as genetic association of long haplotypes. While a new generation of efficient methods for IBD segment detection becomes available, direct comparison of these methods is difficult: existing benchmarks were often evaluated in different datasets, with some not openly accessible; methods benchmarked were run under suboptimal parameters; and benchmark performance metrics were not defined consistently. Here, we developed a comprehensive and completely open-source evaluation of the power, accuracy, and resource consumption of these IBD segment detection methods using realistic population genetic simulations with various settings. Our results pave the road for fair evaluation of IBD segment detection methods and provide an practical guide for users.
Project description:Whereas genomic data are universally machine-readable, data from imaging, multiplex biochemistry, flow cytometry and other cell- and tissue-based assays usually reside in loosely organized files of poorly documented provenance. This arises because the relational databases used in genomic research are difficult to adapt to rapidly evolving experimental designs, data formats and analytic algorithms. Here we describe an adaptive approach to managing experimental data based on semantically typed data hypercubes (SDCubes) that combine hierarchical data format 5 (HDF5) and extensible markup language (XML) file types. We demonstrate the application of SDCube-based storage using ImageRail, a software package for high-throughput microscopy. Experimental design and its day-to-day evolution, not rigid standards, determine how ImageRail data are organized in SDCubes. We applied ImageRail to collect and analyze drug dose-response landscapes in human cell lines at single-cell resolution.