Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

ASM Based Synthesis of Handwritten Arabic Text Pages.

ABSTRACT: Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting recognition, that involves different preprocessing, segmentation, and recognition methods, which have individual demands on samples and ground truth. To bypass this problem, we present an efficient system that automatically turns Arabic Unicode text into synthetic images of handwritten documents and detailed ground truth. Active Shape Models (ASMs) based on 28046 online samples were used for character synthesis and statistical properties were extracted from the IESK-arDB database to simulate baselines and word slant or skew. In the synthesis step ASM based representations are composed to words and text pages, smoothed by B-Spline interpolation and rendered considering writing speed and pen characteristics. Finally, we use the synthetic data to validate a segmentation method. An experimental comparison with the IESK-arDB database encourages to train and test document analysis related methods on synthetic samples, whenever no sufficient natural ground truthed data is available.

SUBMITTER: Dinges L

PROVIDER: S-EPMC4534626 | biostudies-other | 2015

REPOSITORIES: biostudies-other

ACCESS DATA

Json Xml

Similar Datasets

Syntactic- and morphology-based text augmentation framework for Arabic sentiment analysis.

Project description:Arabic language is a challenging language for automatic processing. This is due to several intrinsic reasons such as Arabic multi-dialects, ambiguous syntax, syntactical flexibility and diacritics. Machine learning and deep learning frameworks require big datasets for training to ensure accurate predictions. This leads to another challenge faced by researches using Arabic text; as Arabic textual datasets of high quality are still scarce. In this paper, an intelligent framework for expanding or augmenting Arabic sentences is presented. The sentences were initially labelled by human annotators for sentiment analysis. The novel approach presented in this work relies on the rich morphology of Arabic, synonymy lists, syntactical or grammatical rules, and negation rules to generate new sentences from the seed sentences with their proper labels. Most augmentation techniques target image or video data. This study is the first work to target text augmentation for Arabic language. Using this framework, we were able to increase the size of the initial seed datasets by 10 folds. Experiments that assess the impact of this augmentation on sentiment analysis showed a 42% average increase in accuracy, due to the reliability and the high quality of the rules used to build this framework.

| S-EPMC8049132 | biostudies-literature

Generative adversarial network based adaptive data augmentation for handwritten Arabic text recognition.

Project description:Training deep learning based handwritten text recognition systems needs a lot of data in terms of text images and their corresponding annotations. One way to deal with this issue is to use data augmentation techniques to increase the amount of training data. Generative Adversarial Networks (GANs) based data augmentation techniques are popular in literature especially in tasks related to images. However, specific challenges need to be addressed in order to effectively use GANs for data augmentation in the domain of text recognition. Text data is inherently imbalanced in terms of frequency of different characters appearing in training samples and the training data as a whole. GANs trained on the imbalanced dataset leads to augmented data that does not represent the minority characters well. In this paper, we present an adaptive data augmentation technique using GANs that deals with the issue of class imbalance arising in text recognition problems. We show, using experimental evaluations on two publicly available datasets for handwritten Arabic text recognition, that the GANs trained using the presented technique is effective in dealing with class imbalanced problem by generating augmented data that is balanced in terms of character frequencies. The resulting text recognition systems trained on the balanced augmented data improves the text recognition accuracy as compared to the systems trained using standard techniques.

| S-EPMC8802770 | biostudies-literature

Effect of stemming on text similarity for Arabic language at sentence level.

Project description:Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar-ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.

| S-EPMC8156998 | biostudies-literature

Algorithm based on normal coordinate vectors with 16 segments for the data fusion from hand-written Arabic text implemented with MATLAB.

Project description:Hand-written text recognition is useful for interpreting records in different fields such as healthcare, surgery and police in which professionals may avoid technical equipment and prefer writing notes on paper. In order to perform data fusion from different data sources, handwriting automatic recognition involves barriers such as different ways of writing letters and deformation due to many reasons. This work presents a novel handwriting recognition approach based on the application of coordinate vectors to find similarities in different kinds of deformations. In particular, it has been implemented using 16 segments in order to distinguish all the particularities in matching the new text considering a dataset with a machine-learning approach. The implementation of this approach with MATLAB shows promising results with accuracy of 92.8% for with ensemble and bagged trees, after analyzing 22 possible combinations of machine learning and processing techniques.

| S-EPMC8444069 | biostudies-literature

TEXT-Z

Project description:TEXT-Z

| PRJEB36013 | ENA

A novel approach to secure communication in mega events through Arabic text steganography utilizing invisible Unicode characters.

Project description:Mega events attract mega crowds, and many data exchange transactions are involved among organizers, stakeholders, and individuals, which increase the risk of covert eavesdropping. Data hiding is essential for safeguarding the security, confidentiality, and integrity of information during mega events. It plays a vital role in reducing cyber risks and ensuring the seamless execution of these extensive gatherings. In this paper, a steganographic approach suitable for mega events communication is proposed. The proposed method utilizes the characteristics of Arabic letters and invisible Unicode characters to hide secret data, where each Arabic letter can hide two secret bits. The secret messages hidden using the proposed technique can be exchanged via emails, text messages, and social media, as these are the main communication channels in mega events. The proposed technique demonstrated notable performance with a high-capacity ratio averaging 178% and a perfect imperceptibility ratio of 100%, outperforming most of the previous work. In addition, it proves a performance of security comparable to previous approaches, with an average ratio of 72%. Furthermore, it is better in robustness than all related work, with a robustness against 70% of the possible attacks.

| S-EPMC11419608 | biostudies-literature

Graph-based extractive text summarization method for Hausa text.

Project description:Automatic text summarization is one of the most promising solutions to the ever-growing challenges of textual data as it produces a shorter version of the original document with fewer bytes, but the same information as the original document. Despite the advancements in automatic text summarization research, research involving the development of automatic text summarization methods for documents written in Hausa, a Chadic language widely spoken in West Africa by approximately 150,000,000 people as either their first or second language, is still in early stages of development. This study proposes a novel graph-based extractive single-document summarization method for Hausa text by modifying the existing PageRank algorithm using the normalized common bigrams count between adjacent sentences as the initial vertex score. The proposed method is evaluated using a primarily collected Hausa summarization evaluation dataset comprising of 113 Hausa news articles on ROUGE evaluation toolkits. The proposed approach outperformed the standard methods using the same datasets. It outperformed the TextRank method by 2.1%, LexRank by 12.3%, centroid-based method by 19.5%, and BM25 method by 17.4%.

| S-EPMC10168556 | biostudies-literature

Semiautomated text analytics for qualitative data synthesis.

Project description:Approaches to synthesizing qualitative data have, to date, largely focused on integrating the findings from published reports. However, developments in text mining software offer the potential for efficient analysis of large pooled primary qualitative datasets. This case study aimed to (a) provide a step-by-step guide to using one software application, Leximancer, and (b) interrogate opportunities and limitations of the software for qualitative data synthesis. We applied Leximancer v4.5 to a pool of five qualitative, UK-based studies on transportation such as walking, cycling, and driving, and displayed the findings of the automated content analysis as intertopic distance maps. Leximancer enabled us to "zoom out" to familiarize ourselves with, and gain a broad perspective of, the pooled data. It indicated which studies clustered around dominant topics such as "people." The software also enabled us to "zoom in" to narrow the perspective to specific subgroups and lines of enquiry. For example, "people" featured in men's and women's narratives but were talked about differently, with men mentioning "kids" and "old," whereas women mentioned "things" and "stuff." The approach provided us with a fresh lens for the initial inductive step in the analysis process and could guide further exploration. The limitations of using Leximancer were the substantial data preparation time involved and the contextual knowledge required from the researcher to turn lines of inquiry into meaningful insights. In summary, Leximancer is a useful tool for contributing to qualitative data synthesis, facilitating comprehensive and transparent data coding but can only inform, not replace, researcher-led interpretive work.

| S-EPMC6772124 | biostudies-literature

Text-Based Recession Probabilities

Project description:This paper proposes a new methodology based on textual analysis to forecast US recessions. Specifically, it presents an index in the spirit of Baker et al. (JAMA 131:1593–1636, 2016) and Caldara and Iacoviello (JAMA 1222, 2018) that tracks developments in US real activity. When used in a standard recession probability model, this index outperforms the yield curve-based forecast, a standard method to forecast recessions, at medium horizons, up to 8 months. Moreover, the index contains information not included in yield data, that are useful to understand recession episodes; when included as an additional control along with the slope of the yield curve, it improves forecasting accuracy by between 5% and 40%, depending on the horizon considered. These results are stable to a number of different robustness checks, including different estimation methods, different definitions of recession and controlling for asset purchases by major central banks. Our textual analysis data also improve the forecasting accuracy of several other popular leading indicators for the US business cycle. Supplementary Information The online version contains supplementary material available at 10.1057/s41308-022-00177-5.

| S-EPMC9305065 | biostudies-literature

A Dual-Filter Strategy Integrating CRISPR-based Target Screening and Text Mining for Hand-Foot Syndrome

Project description:The experimental high-throughput screening (HTS) methods, exemplified by CRISPR-based screening, have revolutionized target identification in drug discovery. However, such screens frequently yield extensive and unrelated target lists necessitating costly and time-intensive experimental validation. Here, we propose a dual-filter strategy that integrates literature-mined targets with CRISPR/Cas9 screening outputs, systematically prioritizing the most credible candidates and thereby reducing the experimental validation burden and increasing success rate. To validate this strategy, we applied it with hand-foot syndrome (HFS), a clinically challenging side effect induced by fluoropyrimidine treatment. We identified ATF4 as a key regulator of 5-fluorouracil (5-FU) toxicity in the skin and revealed forskolin as a potential therapeutic agent of HFS through the strategy. Mechanistically, forskolin triggers MEK/ERK-dependent ATF4 induction, subsequently driving 5-FU detoxification via the ATF4-mediated eIF2α/IκB signaling pathway. Our findings demonstrate that this dual-filter strategy could notably accelerate drug discovery by reducing experimental validation burden after target screening.

2025-05-25 | GSE297714 | GEO

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data