Unknown

Dataset Information

0

An interpretable method for automated classification of spoken transcripts and written text.


ABSTRACT: We investigate the differences between spoken language (in the form of radio show transcripts) and written language (Wikipedia articles) in the context of text classification. We present a novel, interpretable method for text classification, involving a linear classifier using a large set of n-gram features, and apply it to a newly generated data set with sentences originating either from spoken transcripts or written text. Our classifier reaches an accuracy less than 0.02 below that of a commonly used classifier (DistilBERT) based on deep neural networks (DNNs). Moreover, our classifier has an integrated measure of confidence, for assessing the reliability of a given classification. An online tool is provided for demonstrating our classifier, particularly its interpretable nature, which is a crucial feature in classification tasks involving high-stakes decision-making. We also study the capability of DistilBERT to carry out fill-in-the-blank tasks in either spoken or written text, and find it to perform similarly in both cases. Our main conclusion is that, with careful improvements, the performance gap between classical methods and DNN-based methods may be reduced significantly, such that the choice of classification method comes down to the need (if any) for interpretability.

SUBMITTER: Wahde M 

PROVIDER: S-EPMC10157555 | biostudies-literature | 2023 May

REPOSITORIES: biostudies-literature

altmetric image

Publications

An interpretable method for automated classification of spoken transcripts and written text.

Wahde Mattias M   Della Vedova Marco L ML   Virgolin Marco M   Suvanto Minerva M  

Evolutionary intelligence 20230504


We investigate the differences between spoken language (in the form of radio show transcripts) and written language (Wikipedia articles) in the context of text classification. We present a novel, interpretable method for text classification, involving a linear classifier using a large set of n-gram features, and apply it to a newly generated data set with sentences originating either from spoken transcripts or written text. Our classifier reaches an accuracy less than 0.02 below that of a common  ...[more]

Similar Datasets

| S-EPMC5069292 | biostudies-literature
| S-EPMC6550425 | biostudies-literature
| S-EPMC10962864 | biostudies-literature
| S-EPMC11867088 | biostudies-literature
| S-EPMC9677348 | biostudies-literature
| S-EPMC7643388 | biostudies-literature
| S-EPMC4766443 | biostudies-literature
| S-EPMC10859832 | biostudies-literature
| S-EPMC5553725 | biostudies-other
| S-EPMC10460947 | biostudies-literature