Dataset Information

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words.

ABSTRACT: Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data.

SUBMITTER: Masua B

PROVIDER: S-EPMC7689026 | biostudies-literature | 2020 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words.

Masua Bernard B Masasi Noel N

Data in brief 20201110

Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were ...[more]

PMID: 33294515

Similar Datasets

Project description:BackgroundStructural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking.ResultsWe present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP.ConclusionsThe basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP.

Project description:Introduction: Social isolation and loneliness (SI/L) are growing problems with serious health implications for older adults, especially in light of the COVID-19 pandemic. We examined transcripts from semi-structured interviews with 97 older adults (mean age 83 years) to identify linguistic features of SI/L. Methods: Natural Language Processing (NLP) methods were used to identify relevant interview segments (responses to specific questions), extract the type and number of social contacts and linguistic features such as sentiment, parts-of-speech, and syntactic complexity. We examined: (1) associations of NLP-derived assessments of social relationships and linguistic features with validated self-report assessments of social support and loneliness; and (2) important linguistic features for detecting individuals with higher level of SI/L by using machine learning (ML) models. Results: NLP-derived assessments of social relationships were associated with self-reported assessments of social support and loneliness, though these associations were stronger in women than in men. Usage of first-person plural pronouns was negatively associated with loneliness in women and positively associated with emotional support in men. ML analysis using leave-one-out methodology showed good performance (F1 = 0.73, AUC = 0.75, specificity = 0.76, and sensitivity = 0.69) of the binary classification models in detecting individuals with higher level of SI/L. Comparable performance were also observed when classifying social and emotional support measures. Using ML models, we identified several linguistic features (including use of first-person plural pronouns, sentiment, sentence complexity, and sentence similarity) that most strongly predicted scores on scales for loneliness and social support. Discussion: Linguistic data can provide unique insights into SI/L among older adults beyond scale-based assessments, though there are consistent gender differences. Future research studies that incorporate diverse linguistic features as well as other behavioral data-streams may be better able to capture the complexity of social functioning in older adults and identification of target subpopulations for future interventions. Given the novelty, use of NLP should include prospective consideration of bias, fairness, accountability, and related ethical and social implications.

Dataset Information

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words.

Publications

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets