Dataset Information

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size.

ABSTRACT: Recently, it was demonstrated that generalized entropies of order ? offer novel and important opportunities to quantify the similarity of symbol sequences where ? is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf's law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.

SUBMITTER: Koplenig A

PROVIDER: S-EPMC7514953 | biostudies-literature | 2019 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size.

Koplenig Alexander A Wolfer Sascha S Müller-Spitzer Carolin C

Entropy (Basel, Switzerland) 20190503 5

Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf's law, i.e., th ...[more]

PMID: 33267178

Dataset Information

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size.

Publications

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Studying language change using price equation and Polya-urn dynamics.
| S-EPMC3299756 | biostudies-literature

Using language input and lexical processing to predict vocabulary size.
| S-EPMC6324580 | biostudies-literature

Generalized entropies, density of states, and non-extensivity.
| S-EPMC7511985 | biostudies-literature

Generalized entropies and logarithms and their duality relations.
| S-EPMC3511158 | biostudies-other

Vocabulary Size Is a Key Factor in Predicting Second Language Lexical Encoding Accuracy.
| S-EPMC8339215 | biostudies-literature

The Dynamics of Language Network Interactions in Lexical Selection: An Intracranial EEG Study.
| S-EPMC7945024 | biostudies-literature

Power and Sample Size Calculations for Generalized Estimating Equations via Local Asymptotics.
| S-EPMC3903421 | biostudies-literature

Change-Plane Analysis for Subgroup Detection and Sample Size Calculation.
| S-EPMC5553128 | biostudies-other

Sample size calculation in three-level cluster randomized trials using generalized estimating equation models.
| S-EPMC8351402 | biostudies-literature

How does language change as a lexical network? An investigation based on written Chinese word co-occurrence networks.
| S-EPMC5830315 | biostudies-literature