Unknown

Dataset Information

0

Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems.


ABSTRACT: Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read ?Arabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems. In this paper, we describe our corpus of Arabic diacritized texts. This corpus is called Tashkeela. It can be used as a linguistic resource tool for natural language processing such as automatic diacritics systems, dis-ambiguity mechanism, features and data extraction. The corpus is freely available, it contains 75 million of fully vocalized words mainly 97 books from classical and modern Arabic language. The corpus is collected from manually vocalized texts using web crawling process.

SUBMITTER: Zerrouki T 

PROVIDER: S-EPMC5310197 | biostudies-literature | 2017 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems.

Zerrouki Taha T   Balla Amar A  

Data in brief 20170203


Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems. In this paper, we describe our corpus of Arabic diacritized texts. This corpus is called Tashkeela. It can be used as a linguisti  ...[more]

Similar Datasets

| S-EPMC6305806 | biostudies-literature
| S-EPMC6010264 | biostudies-literature
| S-EPMC9046013 | biostudies-literature
| S-EPMC2586758 | biostudies-literature
| S-EPMC10912174 | biostudies-literature
| S-EPMC9486029 | biostudies-literature
| S-EPMC4587949 | biostudies-literature
| S-EPMC7407276 | biostudies-literature
| S-EPMC8767308 | biostudies-literature
| S-EPMC4294294 | biostudies-literature