Dataset Information

Principled approach to the selection of the embedding dimension of networks.

ABSTRACT: Network embedding is a general-purpose machine learning technique that encodes network structure in vector spaces with tunable dimension. Choosing an appropriate embedding dimension - small enough to be efficient and large enough to be effective - is challenging but necessary to generate embeddings applicable to a multitude of tasks. Existing strategies for the selection of the embedding dimension rely on performance maximization in downstream tasks. Here, we propose a principled method such that all structural information of a network is parsimoniously encoded. The method is validated on various embedding algorithms and a large corpus of real-world networks. The embedding dimension selected by our method in real-world networks suggest that efficient encoding in low-dimensional spaces is usually possible.

SUBMITTER: Gu W

PROVIDER: S-EPMC8213704 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:For many research questions in perinatal epidemiology, gestational age is a mediator that features the causal pathway between exposure and outcome. A mediator is an intermediate variable between an exposure and outcome, which is influenced by the exposure on the causal pathway to the outcome. Therefore, conventional analyses that adjust, stratify, or match for gestational age or its proxy (eg, preterm vs term deliveries) are problematic. This practice, which is entrenched in perinatal research, induces an overadjustment bias. Depending on the causal question, it may be inappropriate to adjust (or condition) for a mediator, such as gestational age, by either design or statistical analysis, but its effect can be quantified through causal mediation analysis. In an exposition of such methods, we demonstrated the relationship between the exposure and outcome and provided a formal analytical framework to quantify the extent to which a causal effect is influenced by a mediator. We reviewed concepts of confounding and causal inference, introduced the concept of a mediator and illustrated the perils of adjusting for a mediator in an exposure-outcome paradigm for a given causal question, adopted causal methods that call for an evaluation of a mediator in a causal exposure effect on the outcome, and discussed unmeasured confounding assumptions in mediation analysis. Furthermore, we reviewed other developments in the causal mediation analysis literature, including decomposition of a total effect when the mediator interacts with the exposure (4-way decomposition), methods for multiple mediators, mediation methods for case-control studies, mediation methods for time-to-event outcomes, sample size and power analysis for mediation analysis, and available software to apply these methods. To illustrate these methods, we provided a clinical example to estimate the risk of perinatal mortality (outcome) concerning placental abruption (exposure) and to determine the extent to which preterm delivery (mediator; a proxy for gestational age) plays a role in this causal effect. We hoped that the adoption of mediation methods described in this review will move research in perinatal epidemiology away from biased adjustments of mediators toward a more nuanced quantification of effects that pose unique challenges and provide unique insights in our field.

Project description:Biological phenotypes are products of complex evolutionary processes in which selective forces influence multiple biological trait measurements in unknown ways. Phylogenetic comparative methods seek to disentangle these relationships across the evolutionary history of a group of organisms. Unfortunately, most existing methods fail to accommodate high-dimensional data with dozens or even thousands of observations per taxon. Phylogenetic factor analysis offers a solution to the challenge of dimensionality. However, scientists seeking to employ this modeling framework confront numerous modeling and implementation decisions, the details of which pose computational and replicability challenges.We develop new inference techniques that increase both the computational efficiency and modeling flexibility of phylogenetic factor analysis. To facilitate adoption of these new methods, we present a practical analysis plan that guides researchers through the web of complex modeling decisions. We codify this analysis plan in an automated pipeline that distills the potentially overwhelming array of decisions into a small handful of (typically binary) choices.We demonstrate the utility of these methods and analysis plan in four real-world problems of varying scales. Specifically, we study floral phenotype and pollination in columbines, domestication in industrial yeast, life history in mammals, and brain morphology in New World monkeys.General and impactful community employment of these methods requires a data scientific analysis plan that balances flexibility, speed and ease of use, while minimizing model and algorithm tuning. Even in the presence of non-trivial phylogenetic model constraints, we show that one may analytically address latent factor uncertainty in a way that (a) aids model flexibility, (b) accelerates computation (by as much as 500-fold) and (c) decreases required tuning. These efforts coalesce to create an accessible Bayesian approach to high-dimensional phylogenetic comparative methods on large trees.

Dataset Information

Principled approach to the selection of the embedding dimension of networks.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets