Dataset Information

PSE: a tool for browsing a large amount of MEDLINE/PubMed abstracts with gene names and common words as the keywords.

ABSTRACT:

Background

MEDLINE/PubMed (hereinafter called PubMed) is one of the most important literature databases for the biological and medical sciences, but it is impossible to read all related records due to the sheer size of the repository. We usually have to repeatedly enter keywords in a trial-and-error manner to extract useful records. Software which can reduce such a laborious task is therefore required.

Results

We developed a web-based software, the PubMed Sentence Extractor (PSE), which parses large number of PubMed abstracts, extracts and displays the co-occurrence sentences of gene names and other keywords, and some information from EntrezGene records. The result links to whole abstracts and other resources such as the Online Mendelian Inheritance in Men and Reference Sequence. While PSE executes at the sentence-level when evaluating the existence of keywords, the popular PubMed operates at the record-level. Therefore, the relationship between the two keywords, a gene name and a common word, is more accurately captured by PSE than PubMed. In addition, PSE shows the list of keywords and considers the synonyms and variations on gene names. Through these functions, PSE would reduce the task of searching through records for gene information.

Conclusion

We developed PSE in order to extract useful records efficiently from PubMed. This system has four advantages over a simple PubMed search; the reduction in the amount of collected literatures, the showing of keyword lists, the consideration for synonyms and variations on gene names, and the links to external databases. We believe PSE is helpful in collecting necessary literatures efficiently in order to find research targets. PSE is freely available under the GPL licence as additional files to this manuscript.

SUBMITTER: Yoneya T

PROVIDER: S-EPMC1326231 | biostudies-literature | 2005 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

PSE: a tool for browsing a large amount of MEDLINE/PubMed abstracts with gene names and common words as the keywords.

Yoneya Takashi T

BMC bioinformatics 20051210

<h4>Background</h4>MEDLINE/PubMed (hereinafter called PubMed) is one of the most important literature databases for the biological and medical sciences, but it is impossible to read all related records due to the sheer size of the repository. We usually have to repeatedly enter keywords in a trial-and-error manner to extract useful records. Software which can reduce such a laborious task is therefore required.<h4>Results</h4>We developed a web-based software, the PubMed Sentence Extractor (PSE), ...[more]

PMID: 16336692

Similar Datasets

Project description:ObjectiveTo investigate whether language used in science abstracts can skew towards the use of strikingly positive and negative words over time.DesignRetrospective analysis of all scientific abstracts in PubMed between 1974 and 2014.MethodsThe yearly frequencies of positive, negative, and neutral words (25 preselected words in each category), plus 100 randomly selected words were normalised for the total number of abstracts. Subanalyses included pattern quantification of individual words, specificity for selected high impact journals, and comparison between author affiliations within or outside countries with English as the official majority language. Frequency patterns were compared with 4% of all books ever printed and digitised by use of Google Books Ngram Viewer.Main outcome measuresFrequencies of positive and negative words in abstracts compared with frequencies of words with a neutral and random connotation, expressed as relative change since 1980.ResultsThe absolute frequency of positive words increased from 2.0% (1974-80) to 17.5% (2014), a relative increase of 880% over four decades. All 25 individual positive words contributed to the increase, particularly the words "robust," "novel," "innovative," and "unprecedented," which increased in relative frequency up to 15,000%. Comparable but less pronounced results were obtained when restricting the analysis to selected journals with high impact factors. Authors affiliated to an institute in a non-English speaking country used significantly more positive words. Negative word frequencies increased from 1.3% (1974-80) to 3.2% (2014), a relative increase of 257%. Over the same time period, no apparent increase was found in neutral or random word use, or in the frequency of positive word use in published books.ConclusionsOur lexicographic analysis indicates that scientific abstracts are currently written with more positive and negative words, and provides an insight into the evolution of scientific writing. Apparently scientists look on the bright side of research results. But whether this perception fits reality should be questioned.

Project description:BACKGROUND: Time delays are important factors that are often neglected in gene regulatory network (GRN) inference models. Validating time delays from knowledge bases is a challenge since the vast majority of biological databases do not record temporal information of gene regulations. Biological knowledge and facts on gene regulations are typically extracted from bio-literature with specialized methods that depend on the regulation task. In this paper, we mine evidences for time delays related to the transcriptional regulation of yeast from the PubMed abstracts. RESULTS: Since the vast majority of abstracts lack quantitative time information, we can only collect qualitative evidences of time delays. Specifically, the speed-up or delay in transcriptional regulation rate can provide evidences for time delays (shorter or longer) in GRN. Thus, we focus on deriving events related to rate changes in transcriptional regulation. A corpus of yeast regulation related abstracts was manually labeled with such events. In order to capture these events automatically, we create an ontology of sub-processes that are likely to result in transcription rate changes by combining textual patterns and biological knowledge. We also propose effective feature extraction methods based on the created ontology to identify the direct evidences with specific details of these events. Our ontologies outperform existing state-of-the-art gene regulation ontologies in the automatic rule learning method applied to our corpus. The proposed deterministic ontology rule-based method can achieve comparable performance to the automatic rule learning method based on decision trees. This demonstrates the effectiveness of our ontology in identifying rate-changing events. We also tested the effectiveness of the proposed feature mining methods on detecting direct evidence of events. Experimental results show that the machine learning method on these features achieves an F1-score of 71.43%. CONCLUSIONS: The manually labeled corpus of events relating to rate changes in transcriptional regulation for yeast is available in https://sites.google.com/site/wentingntu/data. The created ontologies summarized both biological causes of rate changes in transcriptional regulation and corresponding positive and negative textual patterns from the corpus. They are demonstrated to be effective in identifying rate-changing events, which shows the benefits of combining textual patterns and biological knowledge on extracting complex biological events.

Project description:BackgroundToday, there are more than 18 million articles related to biomedical research indexed in MEDLINE, and information derived from them could be used effectively to save the great amount of time and resources spent by government agencies in understanding the scientific landscape, including key opinion leaders and centers of excellence. Associating biomedical articles with organization names could significantly benefit the pharmaceutical marketing industry, health care funding agencies and public health officials and be useful for other scientists in normalizing author names, automatically creating citations, indexing articles and identifying potential resources or collaborators. Large amount of extracted information helps in disambiguating organization names using machine-learning algorithms.ResultsWe propose NEMO, a system for extracting organization names in the affiliation and normalizing them to a canonical organization name. Our parsing process involves multi-layered rule matching with multiple dictionaries. The system achieves more than 98% f-score in extracting organization names. Our process of normalization that involves clustering based on local sequence alignment metrics and local learning based on finding connected components. A high precision was also observed in normalization.ConclusionNEMO is the missing link in associating each biomedical paper and its authors to an organization name in its canonical form and the Geopolitical location of the organization. This research could potentially help in analyzing large social networks of organizations for landscaping a particular topic, improving performance of author disambiguation, adding weak links in the co-author network of authors, augmenting NLM's MARS system for correcting errors in OCR output of affiliation field, and automatically indexing the PubMed citations with the normalized organization name and country. Our system is available as a graphical user interface available for download along with this paper.

Dataset Information

PSE: a tool for browsing a large amount of MEDLINE/PubMed abstracts with gene names and common words as the keywords.

Background

Results

Conclusion

Publications

PSE: a tool for browsing a large amount of MEDLINE/PubMed abstracts with gene names and common words as the keywords.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets