Dataset Information

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.

ABSTRACT: Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.

SUBMITTER: Akimushkin C

PROVIDER: S-EPMC5268788 | biostudies-literature | 2017

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.

Akimushkin Camilo C Amancio Diego Raphael DR Oliveira Osvaldo Novais ON

PloS one 20170126 1

Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynam ...[more]

PMID: 28125703

Similar Datasets

Project description:Neuromuscular disorders (NMDs) represent an important subset of rare diseases associated with elevated morbidity and mortality whose diagnosis can take years. Here we present a novel approach using systems biology to produce functionally-coherent phenotype clusters that provide insight into the cellular functions and phenotypic patterns underlying NMDs, using the Human Phenotype Ontology as a common framework. Gene and phenotype information was obtained for 424 NMDs in OMIM and 126 NMDs in Orphanet, and 335 and 216 phenotypes were identified as typical for NMDs, respectively. 'Elevated serum creatine kinase' was the most specific to NMDs, in agreement with the clinical test of elevated serum creatinine kinase that is conducted on NMD patients. The approach to obtain co-occurring NMD phenotypes was validated based on co-mention in PubMed abstracts. A total of 231 (OMIM) and 150 (Orphanet) clusters of highly connected co-occurrent NMD phenotypes were obtained. In parallel, a tripartite network based on phenotypes, diseases and genes was used to associate NMD phenotypes with functions, an approach also validated by literature co-mention, with KEGG pathways showing proportionally higher overlap than Gene Ontology and Reactome. Phenotype-function pairs were crossed with the co-occurrent NMD phenotype clusters to obtain 40 (OMIM) and 72 (Orphanet) functionally coherent phenotype clusters. As expected, many of these overlapped with known diseases and confirmed existing knowledge. Other clusters revealed interesting new findings, indicating informative phenotypes for differential diagnosis, providing deeper knowledge of NMDs, and pointing towards specific cell dysfunction caused by pleiotropic genes. This work is an example of reproducible research that i) can help better understand NMDs and support their diagnosis by providing a new tool that exploits existing information to obtain novel clusters of functionally-related phenotypes, and ii) takes us another step towards personalised medicine for NMDs.

Project description:BackgroundResearch on Neglected Tropical Diseases (NTDs) has increased in recent decades, and significant need-gaps in diagnostic and treatment tools remain. Analysing bibliometric data from published research is a powerful method for revealing research efforts, partnerships and expertise. We aim to identify and map NTD research networks in Germany and their partners abroad to enable an informed and transparent evaluation of German contributions to NTD research.Methodology/principal findingsA SCOPUS database search for articles with German author affiliations that were published between 2002 and 2012 was conducted for kinetoplastid and helminth diseases. Open-access tools were used for data cleaning and scientometrics (OpenRefine), geocoding (OpenStreetMaps) and to create (Table2Net), visualise and analyse co-authorship networks (Gephi). From 26,833 publications from around the world that addressed 11 diseases, we identified 1,187 (4.4%) with at least one German author affiliation, and we processed 972 publications for the five most published-about diseases. Of those, we extracted 4,007 individual authors and 863 research institutions to construct co-author networks. The majority of co-authors outside Germany were from high-income countries and Brazil. Collaborations with partners on the African continent remain scattered. NTD research within Germany was distributed among 220 research institutions. We identified strong performers on an individual level by using classic parameters (number of publications, h-index) and social network analysis parameters (betweenness centrality). The research network characteristics varied strongly between diseases.Conclusions/significanceThe share of NTD publications with German affiliations is approximately half of its share in other fields of medical research. This finding underlines the need to identify barriers and expand Germany's otherwise strong research activities towards NTDs. A geospatial analysis of research collaborations with partners abroad can support decisions to strengthen research capacity, particularly in low- and middle-income countries, which were less involved in collaborations than high-income countries. Identifying knowledge hubs within individual researcher networks complements traditional scientometric indicators that are used to identify opportunities for collaboration. Using free tools to analyse research processes and output could facilitate data-driven health policies. Our findings contribute to the prioritisation of efforts in German NTD research at a time of impending local and global policy decisions.

Dataset Information

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.

Publications

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets