Dataset Information

Using text analysis to identify functionally coherent gene groups.

ABSTRACT: The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method, neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature. The method uses statistical natural language processing techniques to interpret biological text. It requires only a corpus of documents relevant to the genes being studied (e.g., all genes in an organism) and an index connecting the documents to appropriate genes. Given a group of genes, neighbor divergence assigns a numerical score indicating how "functionally coherent" the gene group is from the perspective of the published literature. We evaluate our method by testing its ability to distinguish 19 known functional gene groups from 1900 randomly assembled groups. Neighbor divergence achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods. We also apply neighbor divergence to previously published gene expression clusters to assess its ability to recognize gene groups that had been manually identified as representative of a common function.

SUBMITTER: Raychaudhuri S

PROVIDER: S-EPMC187532 | biostudies-literature | 2002 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Using text analysis to identify functionally coherent gene groups.

Raychaudhuri Soumya S Schütze Hinrich H Altman Russ B RB

Genome research 20021001 10

The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method, neighbor divergence, for assessing whether the genes within a group share a common biological funct ...[more]

PMID: 12368251

Similar Datasets

Project description:ImportanceThere has been a growth in the use of performance-based payment models in the past decade, but inherently noisy and stochastic quality measures complicate the assessment of the quality of physician groups. Examining consistently low performance across multiple measures or multiple years could potentially identify a subset of low-quality physician groups.ObjectiveTo identify low-performing physician groups based on consistently low performance after adjusting for patient characteristics across multiple measures or multiple years for 10 commonly used quality measures for diabetes and cardiovascular disease (CVD).Design, setting, and participantsThis cross-sectional study used medical and pharmacy claims and laboratory data for enrollees ages 18 to 65 years with diabetes or CVD in an Aetna health insurance plan between 2016 and 2019. Each physician group's risk-adjusted performance for a given year was estimated using mixed-effects linear probability regression models. Performance was correlated across measures and time, and the proportion of physician groups that performed in the bottom quartile was examined across multiple measures or multiple years. Data analysis was conducted between September 2020 and May 2021.ExposuresPrimary care physician groups.Main outcomes and measuresPerformance scores of 6 quality measures for diabetes and 4 for CVD, including hemoglobin A1c (HbA1c) testing, low-density lipoprotein testing, statin use, HbA1c control, low-density lipoprotein control, and hospital-based utilization.ResultsA total of 786 641 unique enrollees treated by 890 physician groups were included; 414 655 (52.7%) of the enrollees were men and the mean (SD) age was 53 (9.5) years. After adjusting for age, sex, and clinical and social risk variables, correlations among individual measures were weak (eg, performance-adjusted correlation between any statin use and LDL testing for patients with diabetes, r = -0.10) to moderate (correlation between LDL testing for diabetes and LDL testing for CVD, r = .43), but year-to-year correlations for all measures were moderate to strong. One percent or fewer of physician groups performed in the bottom quartile for all 6 diabetes measures or all 4 cardiovascular disease measures in any given year, while 14 (4.0%) to 39 groups (11.1%) were in the bottom quartile in all 4 years for any given measure other than hospital-based utilization for CVD (1.1%).Conclusions and relevanceA subset of physician groups that was consistently low performing could be identified by considering performance measures across multiple years. Considering the consistency of group performance could contribute a novel method to identify physician groups most likely to benefit from limited resources.

Project description:BackgroundToxicogenomics studies often profile gene expression from assays involving multiple doses and time points. The dose- and time-dependent pattern is of great importance to assess toxicity but computational approaches are lacking to effectively utilize this characteristic in toxicity assessment. Topic modeling is a text mining approach, but may be used analogously in toxicogenomics due to the similar data structures between text and gene dysregulation.ResultsTopic modeling was applied to a very large toxicogenomics dataset containing microarray gene expression data from >15,000 samples associated with 131 drugs tested in three different assay platforms (i.e., in vitro assay, in vivo repeated dose study and in vivo single dose experiment) with a design including multiple doses and time points. A set of "topics" which each consist of a set of genes was determined, by which the varying sensitivity of three assay systems was observed. We found that the drug-dependent effect was more pronounced in the two in vivo systems than the in vitro system, while the time-dependent effect was most strongly reflected in the in vitro system followed by the single dose study and lastly the repeated dose experiment. The dose-dependent effect was similar across three assay systems. Although the results indicated a challenge to extrapolate the in vitro results to the in vivo situation, we did notice that, for some drugs but not for all the drugs, the similarity in gene expression patterns was observed across all three assay systems, indicating a possibility of using in vitro systems with careful designs (such as the choice of dose and time point), to replace the in vivo testing strategy. Nonetheless, a potential to replace the repeated dose study by the single-dose short-term methodology was strongly implied.ConclusionsThe study demonstrated that text mining methodologies such as topic modeling provide an alternative method compared to traditional means for data reduction in toxicogenomics, enhancing researchers' capabilities to interpret biological information.

Project description:Parenting interventions offer an evidence-based method for the prevention and early intervention of child mental health problems, but to-date their population-level effectiveness has been limited by poor reach and engagement, particularly for fathers, working mothers, and disadvantaged families. Tailoring intervention content to parents' context offers the potential to enhance parent engagement and learning by increasing relevance of content to parents' daily experiences. However, this approach requires a detailed understanding of the common parenting situations and issues that parents face day-to-day, which is currently lacking. We sought to identify the most common parenting situations discussed by parents on parenting-specific forums of the free online discussion forum, Reddit. We aimed to understand perspectives from both mothers and fathers, and thus retrieved publicly available data from r/Daddit and r/Mommit. We used latent Dirichlet allocation to identify the 10 most common topics discussed in the Reddit posts, and completed a manual text analysis to summarize the parenting situations (defined as involving a parent and their child aged 0-18 years, and describing a potential/actual issue). We retrieved 340 (r/Daddit) and 578 (r/Mommit) original posts. A model with 31 latent Dirichlet allocation topics was best fitting, and 24 topics included posts that met our inclusion criteria for manual review. We identified 45 unique but broadly defined parenting situations. The majority of parenting situations were focused on basic childcare situations relating to eating, sleeping, routines, sickness, and toilet training; or related to how to respond to child negative emotions or difficult behavior. Most situations were discussed in relation to infant or toddler aged children, and there was high consistency in the themes raised in r/Daddit and r/Mommit. Our results offer potential to tailor parenting interventions in a meaningful way, creating opportunities to develop content and resources that are directly relevant to parents' lived experiences.

Project description:BackgroundInappropriate prescribing of diagnostic procedures leads to overdiagnosis, overtreatment and resource waste in healthcare systems. Effective strategies to measure and to overcome inappropriateness are essential to increasing the value and sustainability of care. We aimed to describe the determinants of inappropriate reporting of the clinical question and of inappropriate imaging and endoscopy referrals through an analysis of general practitioners' (GP) referral forms in the province of Reggio Emilia, Italy.MethodsA clinical audit was conducted on routinely collected referral forms of all GPs of Reggio Emilia province. All prescriptions for gastroscopy, colonoscopy, neurological and musculoskeletal computerised tomography (CT) and magnetic resonance imaging (MRI) from 2012 to 2017 were included. The appropriateness of referral forms was assessed using Clinika VAP software, which combines semantic analysis of clinical questions and available metadata. Local protocols agreed on by all physicians defined criteria of appropriateness. Two multilevel logistic models were used to identify multiple predictors of inappropriateness of referral forms and to analyse variability among GPs, primary care subdistricts and healthcare districts.ResultsOverall, 37% of referral forms were classified as inappropriate, gastroscopy and CT showed higher proportions of inappropriate referrals compared to colonoscopy and MRI. Inappropriateness increased with patient age for CT and MRI; for gastroscopy, it was lower for patients aged 65-84 compared to those younger, and for colonoscopy, it was higher for older patients. Fee exemptions were associated with inappropriateness in MRI referral forms. The effect of GPs' practice organization was consistent across all tests, showing higher inappropriateness for primary care medical networks than in primary care medical groups. Male GPs were associated with inappropriateness in endoscopy, and older GPs were associated with inappropriateness in musculoskeletal CT. While there was moderate variability in the inappropriate prescribing among GPs, there was not among the healthcare districts or primary care subdistricts.ConclusionsRoutinely collected data and IT tools can be useful to identify and monitor diagnostic procedures at high risk of inappropriate prescribing. Assessing determinants of inappropriate referral makes it possible to tailor educational and organizational interventions to those who need them.

Dataset Information

Using text analysis to identify functionally coherent gene groups.

Publications

Using text analysis to identify functionally coherent gene groups.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets