Dataset Information

Investigating heterogeneous protein annotations toward cross-corpora utilization.

ABSTRACT:

Background

The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources.

Results

We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned.

Conclusion

Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.

SUBMITTER: Wang Y

PROVIDER: S-EPMC2804683 | biostudies-literature | 2009 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Investigating heterogeneous protein annotations toward cross-corpora utilization.

Wang Yue Y Kim Jin-Dong JD Saetre Rune R Pyysalo Sampo S Tsujii Jun'ichi J

BMC bioinformatics 20091209

<h4>Background</h4>The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance o ...[more]

PMID: 19995463

Similar Datasets

Project description:BackgroundOut-of-hospital cardiac arrest (OHCA) is a leading cause of mortality in the developed world. Timely detection of cardiac arrest and prompt activation of emergency medical services (EMS) are essential, yet challenging. Automated cardiac arrest detection using sensor signals from smartwatches has the potential to shorten the interval between cardiac arrest and activation of EMS, thereby increasing the likelihood of survival.ObjectiveThis cross-sectional survey study aims to investigate users' perspectives on aspects of continuous monitoring such as privacy and data protection, as well as other implications, and to collect insights into their attitudes toward the technology.MethodsWe conducted a cross-sectional web-based survey in the Netherlands among 2 groups of potential users of automated cardiac arrest technology: consumers who already own a smartwatch and patients at risk of cardiac arrest. Surveys primarily consisted of closed-ended questions with some additional open-ended questions to provide supplementary insight. The quantitative data were analyzed descriptively, and a content analysis of the open-ended questions was conducted.ResultsIn the consumer group (n=1005), 90.2% (n=906; 95% CI 88.1%-91.9%) of participants expressed an interest in the technology, and 89% (n=1196; 95% CI 87.3%-90.7%) of the patient group (n=1344) showed interest. More than 75% (consumer group: n= 756; patient group: n=1004) of the participants in both groups indicated they were willing to use the technology. The main concerns raised by participants regarding the technology included privacy, data protection, reliability, and accessibility.ConclusionsThe vast majority of potential users expressed a strong interest in and positive attitude toward automated cardiac arrest detection using smartwatch technology. However, a number of concerns were identified, which should be addressed in the development and implementation process to optimize acceptance and effectiveness of the technology.

Project description:BackgroundMobile apps facilitate patients' access to portals and interaction with their healthcare providers. The COVID-19 pandemic accelerated this trend globally, but little evidence exists on patient portal usage in the Middle East, where internet access and digital literacy are limited. Our study aimed to explore how users utilize a patient portal through its related mobile app (MyChart by EPIC).MethodsWe conducted a cross-sectional survey of MyChart users, recruited from a tertiary care center in Lebanon. We collected MyChart usage patterns, perceived outcomes, and app quality, based on the Mobile Application Rating Scale (user version, uMARS), and sociodemographic factors. We examined associations between app usage, app quality, and sociodemographic factors using Pearson's correlations, Chi-square, ANOVA, and t-tests.Results428 users completed the survey; they were primarily female (63%), aged 41.3 ± 15.6 years, with a higher education level (87%) and a relatively high crowding index of 1.4 ± 0.6. Most of the sample was in good and very good health (78%) and had no chronic illnesses (67%), and accessed the portal through MyChart once a month or less (76%). The most frequently used features were accessing health records (98%), scheduling appointments (67%), and messaging physicians (56%). According to uMARS completers (n = 200), the objective quality score was 3.8 ± 0.5, and the subjective quality was 3.6 ± 0.7. No significant association was found between overall app usage and the mobile app quality measured via uMARS. Moreover, app use frequency was negatively associated with education, socioeconomic status, and perceived health status. On the other hand, app use was positively related to having chronic conditions, the number of physician visits and subjective app quality.ConclusionThe patient portal usage was not associated with app quality but with some of the participants' demographic factors. The app offers a user-friendly, good-quality interface to patient health records and physicians, appreciated chiefly by users with relatively low socioeconomic status and education. While this is encouraging, more research is needed to capture the usage patterns and perceptions of male patients and those with even lower education and socioeconomic status, to make patient portals more inclusive.

Project description:BackgroundConversational agents (CAs) have been developed in outpatient departments to improve physician-patient communication efficiency. As end users, patients' continuance intention is essential for the sustainable development of CAs.ObjectiveThe aim of this study was to facilitate the successful usage of CAs by identifying key factors influencing patients' continuance intention and proposing corresponding managerial implications.MethodsThis study proposed an extended expectation-confirmation model and empirically tested the model via a cross-sectional field survey. The questionnaire included demographic characteristics, multiple-item scales, and an optional open-ended question on patients' specific expectations for CAs. Partial least squares structural equation modeling was applied to assess the model and hypotheses. The qualitative data were analyzed via thematic analysis.ResultsA total of 172 completed questionaries were received, with a 100% (172/172) response rate. The proposed model explained 75.5% of the variance in continuance intention. Both satisfaction (β=.68; P<.001) and perceived usefulness (β=.221; P=.004) were significant predictors of continuance intention. Patients' extent of confirmation significantly and positively affected both perceived usefulness (β=.817; P<.001) and satisfaction (β=.61; P<.001). Contrary to expectations, perceived ease of use had no significant impact on perceived usefulness (β=.048; P=.37), satisfaction (β=-.004; P=.63), and continuance intention (β=.026; P=.91). The following three themes were extracted from the 74 answers to the open-ended question: personalized interaction, effective utilization, and clear illustrations.ConclusionsThis study identified key factors influencing patients' continuance intention toward CAs. Satisfaction and perceived usefulness were significant predictors of continuance intention (P<.001 and P<.004, respectively) and were significantly affected by patients' extent of confirmation (P<.001 and P<.001, respectively). Developing a better understanding of patients' continuance intention can help administrators figure out how to facilitate the effective implementation of CAs. Efforts should be made toward improving the aspects that patients reasonably expect CAs to have, which include personalized interactions, effective utilization, and clear illustrations.

Project description:BackgroundIncreasingly high amounts of heterogeneous and valuable controlled biomolecular annotations are available, but far from exhaustive and scattered in many databases. Several annotation integration and prediction approaches have been proposed, but these issues are still unsolved. We previously created a Genomic and Proteomic Knowledge Base (GPKB) that efficiently integrates many distributed biomolecular annotation and interaction data of several organisms, including 32,956,102 gene annotations, 273,522,470 protein annotations and 277,095 protein-protein interactions (PPIs).ResultsBy comprehensively leveraging transitive relationships defined by the numerous association data integrated in GPKB, we developed a software procedure that effectively detects and supplement consistent biomolecular annotations not present in the integrated sources. According to some defined logic rules, it does so only when the semantic type of data and of their relationships, as well as the cardinality of the relationships, allow identifying molecular biology compliant annotations. Thanks to controlled consistency and quality enforced on data integrated in GPKB, and to the procedures used to avoid error propagation during their automatic processing, we could reliably identify many annotations, which we integrated in GPKB. They comprise 3,144 gene to pathway and 21,942 gene to biological function annotations of many organisms, and 1,027 candidate associations between 317 genetic disorders and 782 human PPIs. Overall estimated recall and precision of our approach were 90.56 % and 96.61 %, respectively. Co-functional evaluation of genes with known function showed high functional similarity between genes with new detected and known annotation to the same pathway; considering also the new detected gene functional annotations enhanced such functional similarity, which resembled the one existing between genes known to be annotated to the same pathway. Strong evidence was also found in the literature for the candidate associations detected between Cystic fibrosis disorder and the PPIs between the CFTR_HUMAN, DERL1_HUMAN, RNF5_HUMAN, AHSA1_HUMAN and GOPC_HUMAN proteins, and between the CHIP_HUMAN and HSP7C_HUMAN proteins.ConclusionsAlthough identified gene annotations and PPI-genetic disorder candidate associations require biological validation, our approach intrinsically provides their in silico evidence based on available data. Public availability within the GPKB (http://www.bioinformatics.deib.polimi.it/GPKB/) of all identified and integrated annotations offers a valuable resource fostering new biomedical-molecular knowledge discoveries.

Project description:Despite the increasing knowledge in both the chemical and biological domains the assimilation and exploration of heterogeneous datasets, encoding information about the chemical, bioactivity and phenotypic properties of compounds, remains a challenge due to requirement for overlap between chemicals assayed across the spaces. Here, we have constructed a novel dataset, larger than we have used in prior work, comprising 579 acute oral toxic compounds and 1427 non-toxic compounds derived from regulatory GHS information, along with their corresponding molecular and protein target descriptors and qHTS in vitro assay readouts from the Tox21 project. We found no clear association between the results of a FAFDrugs4 toxicophore screen and the acute oral toxicity classifications for our compound set; and a screen using a subset of the ToxAlerts toxicophores was also of limited utility, with only slight enrichment toward the toxic set (odds ratio of 1.48). We then investigated to what degree toxic and non-toxic compounds could be separated in each of the spaces, to compare their potential contribution to further analyses. Using an LDA projection, we found the largest degree of separation using chemical descriptors (Cohen's d of 1.95) and the lowest degree of separation between toxicity classes using qHTS descriptors (Cohen's d of 0.67). To compare the predictivity of the feature spaces for the toxicity endpoint, we next trained Random Forest (RF) acute oral toxicity classifiers on either molecular, protein target and qHTS descriptors. RFs trained on molecular and protein target descriptors were most predictive, with ROC AUC values of 0.80-0.92 and 0.70-0.85, respectively, across three test sets. RFs trained on both chemical and protein target descriptors combined exhibited similar predictive performance to the single-domain models (ROC AUC of 0.80-0.91). Model interpretability was improved by the inclusion of protein target descriptors, which allow the identification of specific targets (e.g. Retinal dehydrogenase) with literature links to toxic modes of action (e.g. oxidative stress). The dataset compiled in this study has been made available for future application.

Dataset Information

Investigating heterogeneous protein annotations toward cross-corpora utilization.

Background

Results

Conclusion

Publications

Investigating heterogeneous protein annotations toward cross-corpora utilization.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets