Dataset Information

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.

ABSTRACT: It is tempting to treat frequency trends from the Google Books data sets as indicators of the "true" popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900 s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800-2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.

SUBMITTER: Pechenick EA

PROVIDER: S-EPMC4596490 | biostudies-literature | 2015

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.

Pechenick Eitan Adam EA Danforth Christopher M CM Dodds Peter Sheridan PS

PloS one 20151007 10

It is tempting to treat frequency trends from the Google Books data sets as indicators of the "true" popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. ...[more]

PMID: 26445406

Similar Datasets

Project description:The current study systematically reviewed selected literature on background, current conceptualization, and direction of the issues of linguistic and cultural imperialism in publications of applied linguistics and language teaching to determine themes in the field. To do this, based on the inclusion/exclusion criteria, provided in the PRISMA Chart, 30 most updated and recent articles (mainly since 2020) were selected from the 5 main publications in the field through the advanced search engines. Then, two raters used coding books to screen and code necessary quantitative and qualitative data based on which, a total of 989 general coding schemes and categories were elicited from the coding of the main themes, trends, and findings of linguistic and cultural imperialism. Overall, the main themes of the study were provided in the form of the concepts and perspectives of linguistic and cultural imperialism, informed by the historical directions and the influence of the colonial era. Moreover, the role of power relations and prevailing linguistic dominance in supporting dominant languages and the influence of linguistic and cultural imperialism on L1 acquisition were presented and discussed. Since language imperialism can impact L1 language acquisition by marginalizing local languages and threatening them, each community needs to follow its practical language policy and plans to revitalize and support its languages and cultures. It was suggested that the intersection of linguistic and cultural imperialism impacts social and language identity which can lead to neo-imperialism, colonization, and language hierarchization. The study puts forward some recommendations and suggests future directions to reinforce language rights through different parties with the integration of a human rights perspective in language preservation efforts as the main actions that can be done to improve language awareness of the people. Policy-makers and language decision-makers can follow these guidelines to preserve the legal aspects of the language and cultural identity and utilize foreign languages in more rational and non-threatening ways.

Project description:Exchanging gazes with a social partner in response to an event in the environment is considered an effective means to direct attention, share affective experiences, and highlight a target in the environment. This behavior appears during infancy and plays an important role in children's learning and in shaping their socio-emotional development. It has been suggested that cultural values of the community affect socio-emotional development through attentional dynamics of social reference (Rogoff et al., 1993). Maturational processes of brain-circuits have been found to mediate socio-cultural learning and the behavioral manifestation of cultural norms starting at preschool age (Nelson and Guyer, 2011). The aim of the current study was to investigate the relations between cultural ecology levels and children's joint attention (JA). Initiation of JA bids was studied empirically as a function of the level of social load of the target toy (3 levels), the community level of adherence to traditional values (3 levels), parental education (2 levels), and gender. Sixty-two kindergarten aged children were enrolled in a structured toy-exploration task, during which they were presented with toys of various social loads, with social agents (i.e., mother and experimenter) present nearby, and non-social distracters presented intermittently. Measurements included the child's number of JA bids and the extent of positive affect. Analysis of variance indicated that the child's initiation of JA toward the social partner was affected by all levels of cultural ecology (i.e., toy's social load, adherence to tradition values, parental education, gender), thus supporting the study's hypotheses. The effects were such that overall, children, particularly girls' JA initiation was augmented in social toys and moderated by the socio-cultural variables. These results suggest that cultural ecology is related to children's JA, thereby scaffolding initiation of social sharing cues between children and adults. JA plays a role in adjusting children's internal representations of their respective ecological environment.

Dataset Information

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.

Publications

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets