Dataset Information

Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.

ABSTRACT: As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.

SUBMITTER: Mudunuri US

PROVIDER: S-EPMC3846626 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.

Mudunuri Uma S US Khouja Mohamad M Repetski Stephen S Venkataraman Girish G Che Anney A Luke Brian T BT Girard F Pascal FP Stephens Robert M RM

PloS one 20131202 12

As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medic ...[more]

PMID: 24312478

Similar Datasets

Project description:IntroductionKeloids are lesions characterized by the growth of dense fibrous tissue extending beyond original wound boundaries. Research into the natural history of keloids and potential differences by sociodemographic factors in the USA is limited. This real-world, retrospective cohort study aimed to characterize a population of patients with keloids compared with matched dermatology and general cohorts.MethodsPatients with ≥ 2 International Classification of Diseases codes for keloid ≥ 30 days apart and a confirmed keloid diagnosis from clinical notes enrolled in the OM1 Real-World Data Cloud between 1 January 2013 and 18 March 2022 were age- and sex-matched 1:1:1 to patients without keloids who visited dermatologists ("dermatology cohort") and those who did not ("general cohort"). Results are presented using descriptive statistics and analysis stratified by cohort, race, ethnicity, household income, and education.ResultsOverall, 24,453 patients with keloids were matched to 23,936 dermatology and 24,088 general patients. A numerically higher proportion of patients with keloids were Asian or Black. Among available data for patients with keloids, 67.7% had 1 keloid lesion, and 68.3% had keloids sized 0.5 to < 3 cm. Black patients tended to have larger keloids. Asian and Black patients more frequently had > 1 keloid than did white patients (30.6% vs. 32.5% vs. 20.5%). Among all patients with keloids who had available data, 56.4% had major keloid severity, with major severity more frequent in Black patients. Progression was not significantly associated with race, ethnicity, income, or education level; 29%, 25%, and 20% of the dermatology, keloid, and general cohorts were in the highest income bracket (≥ US$75,000). The proportion of patients with income below the federal poverty line (< US$22,000) and patterns of education level were similar across cohorts.ConclusionA large population of patients in the USA with keloids was identified and characterized using structured/unstructured sources. A numerically higher proportion of patients with keloids were non-white; Black patients had larger, more severe keloids at diagnosis.

Project description:BACKGROUND: With microarray technology, variability in experimental environments such as RNA sources, microarray production, or the use of different platforms, can cause bias. Such systematic differences present a substantial obstacle to the analysis of microarray data, resulting in inconsistent and unreliable information. Therefore, one of the most pressing challenges in the field of microarray technology is how to integrate results from different microarray experiments or combine data sets prior to the specific analysis. RESULTS: Two microarray data sets based on a 17k cDNA microarray system were used, consisting of 82 normal colon mucosa and 72 colorectal cancer tissues. Each data set was prepared from either total RNA or amplified mRNA, and the difference of RNA source between these two data sets was detected by ANOVA (Analysis of variance) model. A simple integration method was introduced which was based on the distributions of gene expression ratios among different microarray data sets. The method transformed gene expression ratios into the form of a reference data set on a gene by gene basis. Hierarchical clustering analysis, density and box plots, and mixture scores with correlation coefficients revealed that the two data sets were well intermingled, indicating that the proposed method minimized the experimental bias. In addition, any RNA source effect was not detected by the proposed transformation method. In the mixed data set, two previously identified subgroups of normal and tumor were well separated, and the efficiency of integration was more prominent in tumor groups than normal groups. The transformation method was slightly more effective when a data set with strong homogeneity in the same experimental group was used as a reference data set. CONCLUSION: Proposed method is simple but useful to combine several data sets from different experimental conditions. With this method, biologically useful information can be detectable by applying various analytic methods to the combined data set with increased sample size.

Project description:BackgroundLow back pain (LBP) is a common condition made up of a variety of anatomic and clinical subtypes. Lumbar disc herniation (LDH) and lumbar spinal stenosis (LSS) are two subtypes highly associated with LBP. Patients with LDH/LSS are often started with non-surgical treatments and if those are not effective then go on to have decompression surgery. However, recommendation of surgery is complicated as the outcome may depend on the patient's health characteristics. We developed a deep learning (DL) model to predict decompression surgery for patients with LDH/LSS.Materials and methodWe used datasets of 8387 and 8620 patients from a prospective study that collected data from four healthcare systems to predict early (within 2 months) and late surgery (within 12 months after a 2 month gap), respectively. We developed a DL model to use patients' demographics, diagnosis and procedure codes, drug names, and diagnostic imaging reports to predict surgery. For each prediction task, we evaluated the model's performance using classical and generalizability evaluation. For classical evaluation, we split the data into training (80%) and testing (20%). For generalizability evaluation, we split the data based on the healthcare system. We used the area under the curve (AUC) to assess performance for each evaluation. We compared results to a benchmark model (i.e. LASSO logistic regression).ResultsFor classical performance, the DL model outperformed the benchmark model for early surgery with an AUC of 0.725 compared to 0.597. For late surgery, the DL model outperformed the benchmark model with an AUC of 0.655 compared to 0.635. For generalizability performance, the DL model outperformed the benchmark model for early surgery. For late surgery, the benchmark model outperformed the DL model.ConclusionsFor early surgery, the DL model was preferred for classical and generalizability evaluation. However, for late surgery, the benchmark and DL model had comparable performance. Depending on the prediction task, the balance of performance may shift between DL and a conventional ML method. As a result, thorough assessment is needed to quantify the value of DL, a relatively computationally expensive, time-consuming and less interpretable method.

Project description:BackgroundPublication of registered clinical trials is a critical step in the timely dissemination of trial findings. However, a significant proportion of completed clinical trials are never published, motivating the need to analyze the factors behind success or failure to publish. This could inform study design, help regulatory decision-making, and improve resource allocation. It could also enhance our understanding of bias in the publication of trials and publication trends based on the research direction or strength of the findings. Although the publication of clinical trials has been addressed in several descriptive studies at an aggregate level, there is a lack of research on the predictive analysis of a trial's publishability given an individual (planned) clinical trial description.ObjectiveWe aimed to conduct a study that combined structured and unstructured features relevant to publication status in a single predictive approach. Established natural language processing techniques as well as recent pretrained language models enabled us to incorporate information from the textual descriptions of clinical trials into a machine learning approach. We were particularly interested in whether and which textual features could improve the classification accuracy for publication outcomes.MethodsIn this study, we used metadata from ClinicalTrials.gov (a registry of clinical trials) and MEDLINE (a database of academic journal articles) to build a data set of clinical trials (N=76,950) that contained the description of a registered trial and its publication outcome (27,702/76,950, 36% published and 49,248/76,950, 64% unpublished). This is the largest data set of its kind, which we released as part of this work. The publication outcome in the data set was identified from MEDLINE based on clinical trial identifiers. We carried out a descriptive analysis and predicted the publication outcome using 2 approaches: a neural network with a large domain-specific language model and a random forest classifier using a weighted bag-of-words representation of text.ResultsFirst, our analysis of the newly created data set corroborates several findings from the existing literature regarding attributes associated with a higher publication rate. Second, a crucial observation from our predictive modeling was that the addition of textual features (eg, eligibility criteria) offers consistent improvements over using only structured data (F1-score=0.62-0.64 vs F1-score=0.61 without textual features). Both pretrained language models and more basic word-based representations provide high-utility text representations, with no significant empirical difference between the two.ConclusionsDifferent factors affect the publication of a registered clinical trial. Our approach to predictive modeling combines heterogeneous features, both structured and unstructured. We show that methods from natural language processing can provide effective textual features to enable more accurate prediction of publication success, which has not been explored for this task previously.

Dataset Information

Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.

Publications

Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets