Dataset Information

Probing the statistical properties of unknown texts: application to the Voynich Manuscript.

ABSTRACT: While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.

SUBMITTER: Amancio DR

PROVIDER: S-EPMC3699599 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Probing the statistical properties of unknown texts: application to the Voynich Manuscript.

Amancio Diego R DR Altmann Eduardo G EG Rybski Diego D Oliveira Osvaldo N ON Costa Luciano da F Lda F

PloS one 20130702 7

While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained fro ...[more]

PMID: 23844002

Similar Datasets

Project description:E-JOURNAL LINKED ABSTRACT URL http://www.current-oncology.com/index.php/oncology/article/view/840/ Pseudocirrhosis is a rare form of liver disease that causes clinical symptoms and shows radiographic signs of cirrhosis, but that has histologic features suggesting a distinct pathologic process. In the setting of cancer, hepatic metastases and systemic chemotherapy are suspected causes of pseudocirrhosis. We present the case of a 49-year-old woman with medullary thyroid carcinoma metastatic to the liver who developed pseudocirrhosis. The patient was initially enrolled in a phase i clinical trial of 5-fluorouracil, leucovorin, and oxaliplatin (folfox) in combination with sunitinib (NCT00599924). After this patient’s liver metastases regressed measurably, she was switched to sunitinib maintenance. After 4 months of combination therapy with folfox–sunitinib and 15 months of sunitinib maintenance, she developed abdominal bloating, early satiety, and right upper quadrant pain that increased with inspiration. Computed tomography of the abdomen revealed cirrhotic morphology changes in the liver, including the appearance of a nodular surface and capsular retraction. The patient had no risk factors for cirrhosis and laboratory testing for causes of liver disease were normal or negative. Core-needle liver biopsy demonstrated sheets and nests of epithelioid and spindle cells resembling the primary tumor; septal fibrosis and regenerative nodules typical of cirrhosis were not observed. The background hepatic plate architecture was intact. Laboratory studies showed increased aminotransferases, alkaline phosphatase, and international normalized ratio, and decreased albumin. Portal hypertension, esophageal varices, portal hypertensive gastropathy, and hepatic hydrothorax developed as a result of advanced liver disease. Because of disease progression, sunitinib was discontinued, and the patient was managed with sorafenib. Pseudocirrhosis has often been attributed to chemotherapeutic agents, particularly in the context of metastatic breast cancer. The toxicity profiles of folfox and sunitinib include hepatic steatosis and other forms of hepatotoxicity, but cirrhotic-like disease has not been reported. Considering the transformation of discrete hepatic metastases into a diffuse carcinomatous infiltrate and the unrelated toxicities of folfox and sunitinib, we diagnosed this patient with carcinomatous pseudocirrhosis secondary to metastatic medullary thyroid carcinoma. We discuss the diagnosis of pseudocirrhosis in this case and review the literature regarding pseudocirrhosis in cancer.

Project description:A variety of high-throughput techniques are now available for constructing comprehensive gene regulatory networks in systems biology. In this study, we report a new statistical approach for facilitating in silico inference of regulatory network structure. The new measure of association, coefficient of intrinsic dependence (CID), is model-free and can be applied to both continuous and categorical distributions. When given two variables X and Y, CID answers whether Y is dependent on X by examining the conditional distribution of Y given X. In this paper, we apply CID to analyze the regulatory relationships between transcription factors (TFs) (X) and their downstream genes (Y) based on clinical data. More specifically, we use estrogen receptor alpha (ERalpha) as the variable X, and the analyses are based on 48 clinical breast cancer gene expression arrays (48A). RESULTS: The analytical utility of CID was evaluated in comparison with four commonly used statistical methods, Galton-Pearson's correlation coefficient (GPCC), Student's t-test (STT), coefficient of determination (CoD), and mutual information (MI). When being compared to GPCC, CoD, and MI, CID reveals its preferential ability to discover the regulatory association where distribution of the mRNA expression levels on X and Y does not fit linear models. On the other hand, when CID is used to measure the association of a continuous variable (Y) against a discrete variable (X), it shows similar performance as compared to STT, and appears to outperform CoD and MI. In addition, this study established a two-layer transcriptional regulatory network to exemplify the usage of CID, in combination with GPCC, in deciphering gene networks based on gene expression profiles from patient arrays. CONCLUSION: CID is shown to provide useful information for identifying associations between genes and transcription factors of interest in patient arrays. When coupled with the relationships detected by GPCC, the association predicted by CID are applicable to the construction of transcriptional regulatory networks. This study shows how information from different data sources and learning algorithms can be integrated to investigate whether relevant regulatory mechanisms identified in cell models can also be partially re-identified in clinical samples of breast cancers. AVAILABILITY: the implementation of CID in R codes can be freely downloaded from (http://homepage.ntu.edu.tw/~lyliu/BC/).

Dataset Information

Probing the statistical properties of unknown texts: application to the Voynich Manuscript.

Publications

Probing the statistical properties of unknown texts: application to the Voynich Manuscript.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets