Dataset Information

Document vectorization method using network information of words.

ABSTRACT: We propose a new method for vectorizing a document using the relational characteristics of the words in the document. For the relational characteristics, we use two types of relational information of a word: 1) the centrality measures of the word and 2) the number of times that the word is used with other words in the document. We propose these methods mainly because information regarding the relations of a word to other words in the document are likely to better represent the unique characteristics of the document than the frequency-based methods (e.g., term frequency and term frequency-inverse document frequency). In experiments using a corpus consisting of 14 documents pertaining to four different topics, the results of clustering analysis using cosine similarities between vectors of relational information for words were comparable to (and more accurate than in some cases) those obtained using vectors of frequency-based methods. The clustering analysis using vectors of tie weights between words yielded the most accurate result. Although the results obtained for the small dataset used in this study can hardly be generalized, they suggest that at least in some cases, vectorization of a document using the relational characteristics of the words can provide more accurate results than the frequency-based vectors.

SUBMITTER: Lee SY

PROVIDER: S-EPMC6638850 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Document vectorization method using network information of words.

Lee Sang Yup SY

PloS one 20190718 7

We propose a new method for vectorizing a document using the relational characteristics of the words in the document. For the relational characteristics, we use two types of relational information of a word: 1) the centrality measures of the word and 2) the number of times that the word is used with other words in the document. We propose these methods mainly because information regarding the relations of a word to other words in the document are likely to better represent the unique characteris ...[more]

PMID: 31318881

Similar Datasets

Project description:Recent advances in two-photon fluorescence microscopy (2PM) have allowed large scale imaging and analysis of blood vessel networks in living mice. However, extracting network graphs and vector representations for the dense capillary bed remains a bottleneck in many applications. Vascular vectorization is algorithmically difficult because blood vessels have many shapes and sizes, the samples are often unevenly illuminated, and large image volumes are required to achieve good statistical power. State-of-the-art, three-dimensional, vascular vectorization approaches often require a segmented (binary) image, relying on manual or supervised-machine annotation. Therefore, voxel-by-voxel image segmentation is biased by the human annotator or trainer. Furthermore, segmented images oftentimes require remedial morphological filtering before skeletonization or vectorization. To address these limitations, we present a vectorization method to extract vascular objects directly from unsegmented images without the need for machine learning or training. The Segmentation-Less, Automated, Vascular Vectorization (SLAVV) source code in MATLAB is openly available on GitHub. This novel method uses simple models of vascular anatomy, efficient linear filtering, and vector extraction algorithms to remove the image segmentation requirement, replacing it with manual or automated vector classification. Semi-automated SLAVV is demonstrated on three in vivo 2PM image volumes of microvascular networks (capillaries, arterioles and venules) in the mouse cortex. Vectorization performance is proven robust to the choice of plasma- or endothelial-labeled contrast, and processing costs are shown to scale with input image volume. Fully-automated SLAVV performance is evaluated on simulated 2PM images of varying quality all based on the large (1.4×0.9×0.6 mm3 and 1.6×108 voxel) input image. Vascular statistics of interest (e.g. volume fraction, surface area density) calculated from automatically vectorized images show greater robustness to image quality than those calculated from intensity-thresholded images.

Project description:BACKGROUND:Automatically extracting relations between chemicals and diseases plays an important role in biomedical text mining. Chemical-disease relation (CDR) extraction aims at extracting complex semantic relationships between entities in documents, which contain intrasentence and intersentence relations. Most previous methods did not consider dependency syntactic information across the sentences, which are very valuable for the relations extraction task, in particular, for extracting the intersentence relations accurately. OBJECTIVE:In this paper, we propose a novel end-to-end neural network based on the graph convolutional network (GCN) and multihead attention, which makes use of the dependency syntactic information across the sentences to improve CDR extraction task. METHODS:To improve the performance of intersentence relation extraction, we constructed a document-level dependency graph to capture the dependency syntactic information across sentences. GCN is applied to capture the feature representation of the document-level dependency graph. The multihead attention mechanism is employed to learn the relatively important context features from different semantic subspaces. To enhance the input representation, the deep context representation is used in our model instead of traditional word embedding. RESULTS:We evaluate our method on CDR corpus. The experimental results show that our method achieves an F-measure of 63.5%, which is superior to other state-of-the-art methods. In the intrasentence level, our method achieves a precision, recall, and F-measure of 59.1%, 81.5%, and 68.5%, respectively. In the intersentence level, our method achieves a precision, recall, and F-measure of 47.8%, 52.2%, and 49.9%, respectively. CONCLUSIONS:The GCN model can effectively exploit the across sentence dependency information to improve the performance of intersentence CDR extraction. Both the deep context representation and multihead attention are helpful in the CDR extraction task.

Project description:BackgroundNetwork-based interventions against epidemic spread are most powerful when the full network structure is known. However, in practice, resource constraints require decisions to be made based on partial network information. We investigated how the accuracy of network data available at individual and village levels affected network-based vaccination effectiveness.MethodsWe simulated a Susceptible-Infected-Recovered process on static empirical social networks from 75 rural Indian villages. First, we used regression analysis to predict the percentage of individuals ever infected (cumulative incidence) based on village-level network properties for simulated datasets from 10 representative villages. Second, we simulated vaccinating 10% of each of the 75 empirical village networks at baseline, selecting vaccinees through one of five network-based approaches: random individuals (Random); random contacts of random individuals (Nomination); random high-degree individuals (High Degree); highest degree individuals (Highest Degree); or most central individuals (Central). The first three approaches require only sample data; the latter two require full network data. We also simulated imposing a limit on how many contacts an individual can nominate (Fixed Choice Design, FCD), which reduces the data collection burden but generates only partially observed networks.ResultsIn regression analysis, we found mean and standard deviation of the degree distribution to strongly predict cumulative incidence. In simulations, the Nomination method reduced cumulative incidence by one-sixth compared to Random vaccination; full network methods reduced infection by two-thirds. The High Degree approach had intermediate effectiveness. Somewhat surprisingly, FCD truncating individuals' degrees at three was as effective as using complete networks.ConclusionsUsing even partial network information to prioritize vaccines at either the village or individual level, i.e. determine the optimal order of communities or individuals within each village, substantially improved epidemic outcomes. Such approaches may be feasible and effective in outbreak settings, and full ascertainment of network structure may not be required.

Dataset Information

Document vectorization method using network information of words.

Publications

Document vectorization method using network information of words.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets