Dataset Information

Heaps' Law and Heaps functions in tagged texts: evidences of their linguistic relevance.

ABSTRACT: We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.

SUBMITTER: Chacoma A

PROVIDER: S-EPMC7137977 | biostudies-literature | 2020 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Heaps' Law and Heaps functions in tagged texts: evidences of their linguistic relevance.

Chacoma A A Zanette D H DH

Royal Society open science 20200318 3

We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the a ...[more]

PMID: 32269820

Similar Datasets

Project description:Many neurocognitive studies on the role of motor structures in action-language processing have implicitly adopted a "dictionary-like" framework within which lexical meaning is constructed on the basis of an invariant set of semantic features. The debate has thus been centered on the question of whether motor activation is an integral part of the lexical semantics (embodied theories) or the result of a post-lexical construction of a situation model (disembodied theories). However, research in psycholinguistics show that lexical semantic processing and context-dependent meaning construction are narrowly integrated. An understanding of the role of motor structures in action-language processing might thus be better achieved by focusing on the linguistic contexts under which such structures are recruited. Here, we therefore analyzed online modulations of grip force while subjects listened to target words embedded in different linguistic contexts. When the target word was a hand action verb and when the sentence focused on that action (John signs the contract) an early increase of grip force was observed. No comparable increase was detected when the same word occurred in a context that shifted the focus toward the agent's mental state (John wants to sign the contract). There mere presence of an action word is thus not sufficient to trigger motor activation. Moreover, when the linguistic context set up a strong expectation for a hand action, a grip force increase was observed even when the tested word was a pseudo-verb. The presence of a known action word is thus not required to trigger motor activation. Importantly, however, the same linguistic contexts that sufficed to trigger motor activation with pseudo-verbs failed to trigger motor activation when the target words were verbs with no motor action reference. Context is thus not by itself sufficient to supersede an "incompatible" word meaning. We argue that motor structure activation is part of a dynamic process that integrates the lexical meaning potential of a term and the context in the online construction of a situation model, which is a crucial process for fluent and efficient online language comprehension.

Project description:Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES), the National Center for Biomedical Ontology (NCBO) Annotator, the Biomedical Concept Annotation System (BeCAS) and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems' output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74%) and their quality (best F1-measure of 33%), independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%), the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems' annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content of the ShARe/CLEF (https://sites.google.com/site/shareclefehealth/data) and i2b2 (https://i2b2.org/NLP/DataSets/) corpora needs to be requested with the individual corpus providers.

Dataset Information

Heaps' Law and Heaps functions in tagged texts: evidences of their linguistic relevance.

Publications

Heaps' Law and Heaps functions in tagged texts: evidences of their linguistic relevance.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets