Dataset Information

Effect of stemming on text similarity for Arabic language at sentence level.

ABSTRACT: Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar-ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.

SUBMITTER: Alhawarat MO

PROVIDER: S-EPMC8156998 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Effect of stemming on text similarity for Arabic language at sentence level.

Alhawarat Mohammad O MO Abdeljaber Hikmat H Hilal Anwer A

PeerJ. Computer science 20210514

Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar-ar). D ...[more]

PMID: 34084932

Dataset Information

Effect of stemming on text similarity for Arabic language at sentence level.

Publications

Effect of stemming on text similarity for Arabic language at sentence level.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

BTSD: A curated transformation of sentence dataset for text classification in Bangla language.
| S-EPMC10415831 | biostudies-literature

The Acceptable Text Similarity Level in Manuscripts Submitted to Scientific Journals.
| S-EPMC10412031 | biostudies-literature

The effect of Arabic language type on banking chatbots adoption.
| S-EPMC10585328 | biostudies-literature

Application of sentence-level text analysis: The role of emotion in an experimental learning intervention.
| S-EPMC8803271 | biostudies-literature

Natural language inference for Malayalam language using language agnostic sentence representation.
| S-EPMC8114806 | biostudies-literature

ArASL: Arabic Alphabets Sign Language Dataset.
| S-EPMC6661066 | biostudies-literature

Systematic characterizations of text similarity in full text biomedical publications.
| S-EPMC2939881 | biostudies-literature

Protocol for a reproducible experimental survey on biomedical sentence similarity.
| S-EPMC7990182 | biostudies-literature

Econo-ESA in semantic text similarity.
| S-EPMC4003000 | biostudies-literature

Supporting the use of standardized nursing terminologies with automatic subject heading prediction: a comparison of sentence-level text classification methods.
| S-EPMC6913232 | biostudies-literature