Dataset Information

An interpretable method for automated classification of spoken transcripts and written text.

ABSTRACT: We investigate the differences between spoken language (in the form of radio show transcripts) and written language (Wikipedia articles) in the context of text classification. We present a novel, interpretable method for text classification, involving a linear classifier using a large set of n-gram features, and apply it to a newly generated data set with sentences originating either from spoken transcripts or written text. Our classifier reaches an accuracy less than 0.02 below that of a commonly used classifier (DistilBERT) based on deep neural networks (DNNs). Moreover, our classifier has an integrated measure of confidence, for assessing the reliability of a given classification. An online tool is provided for demonstrating our classifier, particularly its interpretable nature, which is a crucial feature in classification tasks involving high-stakes decision-making. We also study the capability of DistilBERT to carry out fill-in-the-blank tasks in either spoken or written text, and find it to perform similarly in both cases. Our main conclusion is that, with careful improvements, the performance gap between classical methods and DNN-based methods may be reduced significantly, such that the choice of classification method comes down to the need (if any) for interpretability.

SUBMITTER: Wahde M

PROVIDER: S-EPMC10157555 | biostudies-literature | 2023 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

An interpretable method for automated classification of spoken transcripts and written text.

Wahde Mattias M Della Vedova Marco L ML Virgolin Marco M Suvanto Minerva M

Evolutionary intelligence 20230504

We investigate the differences between spoken language (in the form of radio show transcripts) and written language (Wikipedia articles) in the context of text classification. We present a novel, interpretable method for text classification, involving a linear classifier using a large set of n-gram features, and apply it to a newly generated data set with sentences originating either from spoken transcripts or written text. Our classifier reaches an accuracy less than 0.02 below that of a common ...[more]

PMID: 37360587

Similar Datasets

Project description:Language performance requires support from central cognitive/linguistic abilities as well as the more peripheral sensorimotor skills to plan and implement spoken and written communication. Both output modalities are vulnerable to impairment following damage to the language-dominant hemisphere, but much of the research to date has focused exclusively on spoken language. In this study we aimed to examine an integrated model of language processing that includes the common cognitive processes that support spoken and written language, as well as modality-specific skills. To do so, we evaluated spoken and written language performance from 87 individuals with acquired language impairment resulting from damage to left perisylvian cortical regions that collectively constitute the dorsal language pathway. Comprehensive behavioral assessment served to characterize the status of central and peripheral components of language processing in relation to neurotypical controls (n = 38). Performance data entered into principal components analyses (with or without control scores) consistently yielded a strong five-factor solution. In line with a primary systems framework, three central cognitive factors emerged: semantics, phonology, and orthography that were distinguished from peripheral processes supporting speech production and allographic skill for handwriting. The central phonology construct reflected performance on phonological awareness and manipulation tasks and showed the greatest deficit of all the derived factors. Importantly, this phonological construct was orthogonal to the speech production factor that reflected repetition of words/non-words. When entered into regression analyses, semantics and phonological skill were common predictors of language performance across spoken and written modalities. The speech production factor was also a strong, distinct predictor of spoken naming and oral reading, in contrast to allographic skills which only predicted written output. As expected, visual orthographic processing contributed more to written than spoken language tasks and reading/spelling performance was strongly reliant on phonological and semantic abilities. Despite the heterogeneity of this cohort regarding aphasia type and severity, the marked impairment of phonological skill was a unifying feature. These findings prompt greater attention to clinical assessment and potential treatment of underlying phonological skill in individuals with left perisylvian damage.

Dataset Information

An interpretable method for automated classification of spoken transcripts and written text.

Publications

An interpretable method for automated classification of spoken transcripts and written text.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets