Dataset Information

Content-rich biological network constructed by mining PubMed abstracts.

ABSTRACT:

Background

The integration of the rapidly expanding corpus of information about the genome, transcriptome, and proteome, engendered by powerful technological advances, such as microarrays, and the availability of genomic sequence from multiple species, challenges the grasp and comprehension of the scientific community. Despite the existence of text-mining methods that identify biological relationships based on the textual co-occurrence of gene/protein terms or similarities in abstract texts, knowledge of the underlying molecular connections on a large scale, which is prerequisite to understanding novel biological processes, lags far behind the accumulation of data. While computationally efficient, the co-occurrence-based approaches fail to characterize (e.g., inhibition or stimulation, directionality) biological interactions. Programs with natural language processing (NLP) capability have been created to address these limitations, however, they are in general not readily accessible to the public.

Results

We present a NLP-based text-mining approach, Chilibot, which constructs content-rich relationship networks among biological concepts, genes, proteins, or drugs. Amongst its features, suggestions for new hypotheses can be generated. Lastly, we provide evidence that the connectivity of molecular networks extracted from the biological literature follows the power-law distribution, indicating scale-free topologies consistent with the results of previous experimental analyses.

Conclusions

Chilibot distills scientific relationships from knowledge available throughout a wide range of biological domains and presents these in a content-rich graphical format, thus integrating general biomedical knowledge with the specialized knowledge and interests of the user. Chilibot http://www.chilibot.net can be accessed free of charge to academic users.

SUBMITTER: Chen H

PROVIDER: S-EPMC528731 | biostudies-literature | 2004 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Content-rich biological network constructed by mining PubMed abstracts.

Chen Hao H Sharp Burt M BM

BMC bioinformatics 20041008

<h4>Background</h4>The integration of the rapidly expanding corpus of information about the genome, transcriptome, and proteome, engendered by powerful technological advances, such as microarrays, and the availability of genomic sequence from multiple species, challenges the grasp and comprehension of the scientific community. Despite the existence of text-mining methods that identify biological relationships based on the textual co-occurrence of gene/protein terms or similarities in abstract te ...[more]

PMID: 15473905

Similar Datasets

Project description:BackgroundThe field of epidemiological criminology (or justice health research) has emerged in the past decade, studying the intersection between the public health and justice systems. To ensure research efforts are focused and equitable, it is important to reflect on the outputs in this area and address knowledge gaps.ObjectiveThis study aimed to examine the characteristics of populations researched in a large sample of published outputs and identify research gaps and biases.MethodsA rule-based, text mining method was applied to 34,481 PubMed abstracts published from 1963 to 2023 to identify 4 population characteristics (sex, age, offender type, and nationality).ResultsWe evaluated our method in a random sample of 100 PubMed abstracts. Microprecision was 94.3%, with microrecall at 85.9% and micro-F1-score at 89.9% across the 4 characteristics. Half (n=17,039, 49.4%) of the 34,481 abstracts did not have any characteristic mentions and only 1.3% (n=443) reported sex, age, offender type, and nationality. From the 5170 (14.9%) abstracts that reported age, 3581 (69.3%) mentioned young people (younger than 18 years) and 3037 (58.7%) mentioned adults. Since 1990, studies reporting female-only populations increased, and in 2023, these accounted for almost half (105/216, 48.6%) of the research outputs, as opposed to 33.3% (72/216) for male-only populations. Nordic countries (Sweden, Norway, Finland, and Denmark) had the highest number of abstracts proportional to their incarcerated populations. Offenders with mental illness were the most common group of interest (840/4814, 17.4%), with an increase from 1990 onward.ConclusionsResearch reporting on female populations increased, surpassing that involving male individuals, despite female individuals representing 5% of the incarcerated population; this suggests that male prisoners are underresearched. Although calls have been made for the justice health area to focus more on young people, our results showed that among the abstracts reporting age, most mentioned a population aged <18 years, reflecting a rise of youth involvement in the youth justice system. Those convicted of sex offenses and crimes relating to children were not as researched as the existing literature suggests, with a focus instead on populations with mental illness, whose rates rose steadily in the last 30 years. After adjusting for the size of the incarcerated population, Nordic countries have conducted proportionately the most research. Our findings highlight that despite the presence of several research reporting guidelines, justice health abstracts still do not adequately describe the investigated populations. Our study offers new insights in the field of justice health with implications for promoting diversity in the selection of research participants.

Project description:BackgroundThe emerging field of epidemiological criminology studies the intersection between public health and justice systems. To increase the value of and reduce waste in research activities in this area, it is important to perform transparent research priority setting considering the needs of research beneficiaries and end users along with a systematic assessment of the existing research activities to address gaps and harness opportunities.ObjectiveIn this study, we aimed to examine published research outputs in epidemiological criminology to assess gaps between published outputs and current research priorities identified by prison stakeholders.MethodsA rule-based method was applied to 23,904 PubMed epidemiological criminology abstracts to extract the study determinants and outcomes (ie, "themes"). These were mapped against the research priorities identified by Australian prison stakeholders to assess the differences from research outputs. The income level of the affiliation country of the first authors was also identified to compare the ranking of research priorities in countries categorized by income levels.ResultsOn an evaluation set of 100 abstracts, the identification of themes returned an F1-score of 90%, indicating reliable performance. More than 53.3% (11,927/22,361) of the articles had at least 1 extracted theme; the most common was substance use (1533/11,814, 12.97%), followed by HIV (1493/11,814, 12.64%). The infectious disease category (2949/11,814, 24.96%) was the most common research priority category, followed by mental health (2840/11,814, 24.04%) and alcohol and other drug use (2433/11,814, 20.59%). A comparison between the extracted themes and the stakeholder priorities showed an alignment for mental health, infectious diseases, and alcohol and other drug use. Although behavior- and juvenile-related themes were common, they did not feature as prison priorities. Most studies were conducted in high-income countries (10,083/11,814, 85.35%), while countries with the lowest income status focused half of their research on infectious diseases (47/91, 52%).ConclusionsThe identification of research themes from PubMed epidemiological criminology research abstracts is possible through the application of a rule-based text mining method. The frequency of the investigated themes may reflect historical developments concerning disease prevalence, treatment advances, and the social understanding of illness and incarcerated populations. The differences between income status groups are likely to be explained by local health priorities and immediate health risks. Notable gaps between stakeholder research priorities and research outputs concerned themes that were more focused on social factors and systems and may reflect publication bias or self-publication selection, highlighting the need for further research on prison health services and the social determinants of health. Different jurisdictions, countries, and regions should undertake similar systematic and transparent research priority-setting processes.

Project description:BackgroundThe Enteropathogen Resource Integration Center (ERIC; http://www.ericbrc.org) has a goal of providing bioinformatics support for the scientific community researching enteropathogenic bacteria such as Escherichia coli and Salmonella spp. Rapid and accurate identification of experimental conclusions from the scientific literature is critical to support research in this field. Natural Language Processing (NLP), and in particular Information Extraction (IE) technology, can be a significant aid to this process.DescriptionWe have trained a powerful, state-of-the-art IE technology on a corpus of abstracts from the microbial literature in PubMed to automatically identify and categorize biologically relevant entities and predicative relations. These relations include: Genes/Gene Products and their Roles; Gene Mutations and the resulting Phenotypes; and Organisms and their associated Pathogenicity. Evaluations on blind datasets show an F-measure average of greater than 90% for entities (genes, operons, etc.) and over 70% for relations (gene/gene product to role, etc). This IE capability, combined with text indexing and relational database technologies, constitute the core of our recently deployed text mining application.ConclusionOur Text Mining application is available online on the ERIC website (http://www.ericbrc.org/portal/eric/articles). The information retrieval interface displays a list of recently published enteropathogen literature abstracts, and also provides a search interface to execute custom queries by keyword, date range, etc. Upon selection, processed abstracts and the entities and relations extracted from them are retrieved from a relational database and marked up to highlight the entities and relations. The abstract also provides links from extracted genes and gene products to the ERIC Annotations database, thus providing access to comprehensive genomic annotations and adding value to both the text-mining and annotations systems.

Dataset Information

Content-rich biological network constructed by mining PubMed abstracts.

Background

Results

Conclusions

Publications

Content-rich biological network constructed by mining PubMed abstracts.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets