Dataset Information

Discovering semantic features in the literature: a foundation for building functional associations.

ABSTRACT: BACKGROUND: Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. RESULTS: We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. CONCLUSION: The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data.

SUBMITTER: Chagoyen M

PROVIDER: S-EPMC1386711 | biostudies-literature | 2006

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Discovering semantic features in the literature: a foundation for building functional associations.

Chagoyen Monica M Carmona-Saez Pedro P Shatkay Hagit H Carazo Jose M JM Pascual-Montano Alberto A

BMC bioinformatics 20060126

<h4>Background</h4>Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validatio ...[more]

PMID: 16438716

Similar Datasets

Project description:BACKGROUND: Elucidation of the direct/indirect protein interactions and gene associations is required to fully understand the workings of the cell. This can be achieved through the use of both low- and high-throughput biological experiments and in silico methods. We present GAP (Gene functional Association Predictor), an integrative method for predicting and characterizing gene functional associations. GAP integrates different biological features using a novel taxonomy-based semantic similarity measure in predicting and prioritizing high-quality putative gene associations. The proposed similarity measure increases information gain from the available gene annotations. The annotation information is incorporated from several public pathway databases, Gene Ontology annotations as well as drug and disease associations from the scientific literature. RESULTS: We evaluated GAP by comparing its prediction performance with several other well-known functional interaction prediction tools over a comprehensive dataset of known direct and indirect interactions, and observed significantly better prediction performance. We also selected a small set of GAP's highly-scored novel predicted pairs (i.e., currently not found in any known database or dataset), and by manually searching the literature for experimental evidence accessible in the public domain, we confirmed different categories of predicted functional associations with available evidence of interaction. We also provided extra supporting evidence for subset of the predicted functionally-associated pairs using an expert curated database of genes associated to autism spectrum disorders. CONCLUSIONS: GAP's predicted "functional interactome" contains ?1M highly-scored predicted functional associations out of which about 90% are novel (i.e., not experimentally validated). GAP's novel predictions connect disconnected components and singletons to the main connected component of the known interactome. It can, therefore, be a valuable resource for biologists by providing corroborating evidence for and facilitating the prioritization of potential direct or indirect interactions for experimental validation. GAP is freely accessible through a web portal: http://ophid.utoronto.ca/gap.

Project description:PURPOSE:There is a lack of agreement regarding the types of lesions and clinical conditions that should be included in the term "geographic atrophy." Varied and conflicting views prevail throughout the literature and are currently used by retinal experts and other health care professionals. METHODS:We reviewed the nominal definition of the term "geographic atrophy" and conducted a search of the ophthalmologic literature focusing on preceding terminologies and the first citations of the term "geographic atrophy" secondary to age-related macular degeneration. RESULTS:According to the nominal definition, the term "geography" stands for a detailed description of the surface features of a specific region, indicating its relative position. However, it does not necessarily imply that the borders of the region must be sharply demarcated or related to any anatomical structures. The term "geographical areas of atrophy" was initially cited in the 1960s in the ophthalmologic literature in the context of uveitic eye disease and shortly thereafter also for the description of variants of "senile macular degeneration." However, no direct explanation could be found in the literature as to why the terms "geographical" and "geographic" were chosen. Presumably the terms were used as the atrophic regions resembled the map of a continent or well-defined country borders on thematic geographical maps. With the evolution of the terminology, the commonly used adjunct "of the retinal pigment epithelium" was frequently omitted and solely the term "geographic atrophy" prevailed for the nonexudative late-stage of age-related macular degeneration itself. Along with the quantification of atrophic areas, based on different imaging modalities and the use of both manual and semiautomated approaches, various and inconsistent definitions for the minimal lesion diameter or size of atrophic lesions have also emerged. CONCLUSION:Reconsideration of the application of the term "geographic atrophy" in the context of age-related macular degeneration seems to be prudent given ongoing advances in multimodal retinal imaging technology with identification of various phenotypic characteristics, and the observation of atrophy development in eyes under antiangiogenic therapy.

Dataset Information

Discovering semantic features in the literature: a foundation for building functional associations.

Publications

Discovering semantic features in the literature: a foundation for building functional associations.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets