Dataset Information

SUBTLEX-CH: Chinese word and character frequencies based on film subtitles.

ABSTRACT:

Background

Word frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.

Methodology

Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.

Conclusions

Our results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.

SUBMITTER: Cai Q

PROVIDER: S-EPMC2880003 | biostudies-literature | 2010 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SUBTLEX-CH: Chinese word and character frequencies based on film subtitles.

Cai Qing Q Brysbaert Marc M

PloS one 20100602 6

<h4>Background</h4>Word frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.<h4>Methodology</h4>Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film a ...[more]

PMID: 20532192

Dataset Information

SUBTLEX-CH: Chinese word and character frequencies based on film subtitles.

Background

Methodology

Conclusions

Publications

SUBTLEX-CH: Chinese word and character frequencies based on film subtitles.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

The role of character positional frequency on Chinese word learning during natural reading.
| S-EPMC5685568 | biostudies-literature

WSE, a new sequence distance measure based on word frequencies.
| S-EPMC7185439 | biostudies-literature

Readers extract character frequency information from nonfixated-target word at long pretarget fixations during Chinese reading.
| S-EPMC4767270 | biostudies-literature

A further look at ageing and word predictability effects in Chinese reading: Evidence from one-character words.
| S-EPMC7745612 | biostudies-literature

Record Linkage For Character-Based Surnames: Evidence from Chinese Exclusion.
| S-EPMC9854273 | biostudies-literature

Fast alignment-free sequence comparison using spaced-word frequencies.
| S-EPMC4080745 | biostudies-literature

Semantic ambiguity effects on traditional Chinese character naming: A corpus-based approach.
| S-EPMC6267517 | biostudies-literature

Polymers based on thieno[3,4-<i>c</i>]pyrrole-4,6-dione and pyromellitic diimide by CH-CH arylation reaction for high-performance thin-film transistors.
| S-EPMC9623454 | biostudies-literature

How humans transmit language: horizontal transmission matches word frequencies among peers on Twitter.
| S-EPMC5832726 | biostudies-literature