Project description:Data on the use of time in different exposures, behaviors, and work tasks are common in occupational research. Such data are most often expressed in hours, minutes, or percentage of work time. Thus, they are constrained or 'compositional', in that they add up to a finite sum (e.g. 8 h of work or 100% work time). Due to their properties, compositional data need to be processed and analyzed using specifically adapted methods. Compositional data analysis (CoDA) has become a particularly established framework to handle such data in various scientific fields such as nutritional epidemiology, geology, and chemistry, but has only recently gained attention in public and occupational health sciences. In this paper, we introduce the reader to CoDA by explaining why CoDA should be used when dealing with compositional time-use data, showing how to perform CoDA, including a worked example, and pointing at some remaining challenges in CoDA. The paper concludes by emphasizing that CoDA in occupational research is still in its infancy, and stresses the need for further development and experience in the use of CoDA for time-based occupational exposures. We hope that the paper will encourage researchers to adopt and apply CoDA in studies of work exposures and health.
Project description:This manuscript considers that the composition of Manzanilla and Hojiblanca fats are compositional data (CoDa). Thus, the work applies CoDa analysis (CoDA) to investigate the effect of processing and packaging on the fatty acid profiles of these cultivars. To this aim, the values of the fat components in percentages were successively subjected to exploratory CoDA tools and, later, transformed into ilr (isometric log-ratio) coordinates in the Euclidean space, where they were subjected to the standard multivariate techniques. The results from the first approach (bar plots of geometric means, tetrahedral plots, compositional biplots, and balance dendrograms) showed that the effect of processing was limited while most of the variability among the fatty acid (FA) profiles was due to cultivars. The application of the standard multivariate methods (i.e., Canonical variates, Linear Discriminant Analysis (LDA), ANOVA/MANOVA with bootstrapping and n = 1000, and nested General Linear Model (GLM)) to the ilr coordinates transformed data, following Ward's clustering or descending order of variances criteria, showed similar effects to the exploratory analysis but also showed that Hojiblanca was more sensitive to fat modifications than Manzanilla. On the contrary, associating GLM changes in ilr with fatty acids was not straightforward because of the complex deduction of some coordinates. Therefore, according to the CoDA, table olive fatty acid profiles are scarcely affected by Spanish-style processing compared with the differences between cultivars. This work has demonstrated that CoDA could be successfully applied to study the fatty acid profiles of olive fat and olive oils and may represent a model for the statistical analysis of other fats, with the advantage of applying appropriate statistical techniques and preventing misinterpretations.
Project description:The analysis of the combined mRNA and miRNA content of a biological sample can be of interest for answering several research questions, like biomarkers discovery, or mRNA-miRNA interactions. However, the process is costly and time-consuming, separate libraries need to be prepared and sequenced on different flowcells. Combo-Seq is a library prep kit that allows us to prepare combined mRNA-miRNA libraries starting from very low total RNA. To date, no dedicated bioinformatics method exists for the processing of Combo-Seq data. In this paper, we describe CODA (Combo-seq Data Analysis), a workflow specifically developed for the processing of Combo-Seq data that employs existing free-to-use tools. We compare CODA with exceRpt, the pipeline suggested by the kit manufacturer for this purpose. We also evaluate how Combo-Seq libraries analysed with CODA perform compared with conventional poly(A) and small RNA libraries prepared from the same samples. We show that using CODA more successfully trimmed reads are recovered compared with exceRpt, and the difference is more dramatic with short sequencing reads. We demonstrate how Combo-Seq identifies as many genes and fewer miRNAs compared to the standard libraries, and how miRNA validation favours conventional small RNA libraries over Combo-Seq. The CODA code is available at https://github.com/marta-nazzari/CODA.
Project description:PurposePrivacy-protecting analytic and data-sharing methods that minimize the disclosure risk of sensitive information are increasingly important due to the growing interest in utilizing data across multiple sources. We conducted a simulation study to examine how avoiding sharing individual-level data in a distributed data network can affect analytic results.MethodsThe base scenario had four sites of varying sizes with 5% outcome incidence, 50% treatment prevalence, and seven confounders. We varied treatment prevalence, outcome incidence, treatment effect, site size, number of sites, and covariate distribution. Confounding adjustment was conducted using propensity score or disease risk score. We compared analyses of three types of aggregate-level data requested from sites: risk-set, summary-table, or effect-estimate data (meta-analysis) with benchmark results of analysis of pooled individual-level data. We assessed bias and precision of hazard ratio estimates as well as the accuracy of standard error estimates.ResultsAll the aggregate-level data-sharing approaches, regardless of confounding adjustment methods, successfully approximated pooled individual-level data analysis in most simulation scenarios. Meta-analysis showed minor bias when using inverse probability of treatment weights (IPTW) in infrequent exposure (5%), rare outcome (0.01%), and small site (5,000 patients) settings. SE estimates became less accurate for IPTW risk-set approach with less frequent exposure and for propensity score-matching meta-analysis approach with rare outcomes.ConclusionsOverall, we found that we can avoid sharing individual-level data and obtain valid results in many settings, although care must be taken with meta-analysis approach in infrequent exposure and rare outcome scenarios, particularly when confounding adjustment is performed with IPTW.
Project description:BackgroundThere is no gold standard in body composition measurement in pediatric patients with obesity. Therefore, the aim of this study was to investigate if there are any differences between two bioelectrical impedance analysis techniques performed in children and adolescents with obesity.MethodsData were collected at the Department of Pediatrics and Adolescent Medicine in Vienna from September 2015 to May 2017. Body composition measurement was performed with TANITA scale and BIA-BIACORPUS.ResultsIn total, 38 children and adolescents (age: 10-18 years, BMI: 25-54 kg/m2) were included. Boys had significantly increased fat free mass (TANITA p = 0.019, BIA p = 0.003), total body water (TANITA p = 0.020, BIA p = 0.005), and basal metabolic rate (TANITA p = 0.002, BIA p = 0.029). Girls had significantly increased body fat percentage with BIA (BIA p = 0.001). No significant gender differences of core abdominal area have been determined. TANITA overestimated body fat percentage (p < 0.001), fat mass (p = 0.002), and basal metabolic rate (p < 0.001) compared to BIA. TANITA underestimated fat free mass (p = 0.002) in comparison to BIA. The Bland Altman plot demonstrated a low agreement between the body composition methods.ConclusionsLow agreement between TANITA scale and BIA-BIACORPUS has been observed. Body composition measurement should always be performed by the same devices to obtain comparable results. At clinical routine due to its feasibility, safety, and efficiency, bioelectrical impedance analysis is appropriate for obese pediatric patients.Trial registrationClinicalTrials NCT02545764 . Registered 10 September 2015.
Project description:Secondary analyses of survey data collected from large probability samples of persons or establishments further scientific progress in many fields. The complex design features of these samples improve data collection efficiency, but also require analysts to account for these features when conducting analysis. Unfortunately, many secondary analysts from fields outside of statistics, biostatistics, and survey methodology do not have adequate training in this area, and as a result may apply incorrect statistical methods when analyzing these survey data sets. This in turn could lead to the publication of incorrect inferences based on the survey data that effectively negate the resources dedicated to these surveys. In this article, we build on the results of a preliminary meta-analysis of 100 peer-reviewed journal articles presenting analyses of data from a variety of national health surveys, which suggested that analytic errors may be extremely prevalent in these types of investigations. We first perform a meta-analysis of a stratified random sample of 145 additional research products analyzing survey data from the Scientists and Engineers Statistical Data System (SESTAT), which describes features of the U.S. Science and Engineering workforce, and examine trends in the prevalence of analytic error across the decades used to stratify the sample. We once again find that analytic errors appear to be quite prevalent in these studies. Next, we present several example analyses of real SESTAT data, and demonstrate that a failure to perform these analyses correctly can result in substantially biased estimates with standard errors that do not adequately reflect complex sample design features. Collectively, the results of this investigation suggest that reviewers of this type of research need to pay much closer attention to the analytic methods employed by researchers attempting to publish or present secondary analyses of survey data.
Project description:BACKGROUND & AIMS:A diagnosis of cirrhosis can be made on the basis of findings from imaging studies, but these are subjective. Analytic morphomics uses computational image processing algorithms to provide precise and detailed measurements of organs and body tissues. We investigated whether morphomic parameters can be used to identify patients with cirrhosis. METHODS:In a retrospective study, we performed analytic morphomics on data collected from 357 patients evaluated at the University of Michigan from 2004 to 2012 who had a liver biopsy within 6 months of a computed tomography scan for any reason. We used logistic regression with elastic net regularization and cross-validation to develop predictive models for cirrhosis, within 80% randomly selected internal training set. The other 20% data were used as internal test set to ensure that model overfitting did not occur. In validation studies, we tested the performance of our models on an external cohort of patients from a different health system. RESULTS:Our predictive models, which were based on analytic morphomics and demographics (morphomics model) or analytic morphomics, demographics, and laboratory studies (full model), identified patients with cirrhosis with area under the receiver operating characteristic curve (AUROC) values of 0.91 and 0.90, respectively, compared with 0.69, 0.77, and 0.76 for aspartate aminotransferase-to-platelet ratio, Lok Score, and FIB-4, respectively, by using the same data set. In the validation set, our morphomics model identified patients who developed cirrhosis with AUROC value of 0.97, and the full model identified them with AUROC value of 0.90. CONCLUSIONS:We used analytic morphomics to demonstrate that cirrhosis can be objectively quantified by using medical imaging. In a retrospective analysis of multi-protocol scans, we found that it is possible to identify patients who have cirrhosis on the basis of analyses of preexisting scans, without significant additional risk or cost.
Project description:Recently, it has been shown that targeted mutagenesis using zinc-finger nucleases (ZFNs) and transcription activator-like effector nucleases (TALENs) can be used to generate knockout zebrafish lines for analysis of their function and/or developing disease models. A number of different methods have been developed for the design and assembly of gene-specific ZFNs and TALENs, making them easily available to most zebrafish researchers. Regardless of the choice of targeting nuclease, the process of generating mutant fish is similar. It is a time-consuming and multi-step process that can benefit significantly from development of efficient high throughput methods. In this study, we used ZFNs assembled through either the CompoZr (Sigma-Aldrich) or the CoDA (context-dependent assembly) platforms to generate mutant zebrafish for nine genes. We report our improved high throughput methods for 1) evaluation of ZFNs activity by somatic lesion analysis using colony PCR, eliminating the need for plasmid DNA extractions from a large number of clones, and 2) a sensitive founder screening strategy using fluorescent PCR with PIG-tailed primers that eliminates the stutter bands and accurately identifies even single nucleotide insertions and deletions. Using these protocols, we have generated multiple mutant alleles for seven genes, five of which were targeted with CompoZr ZFNs and two with CoDA ZFNs. Our data also revealed that at least five-fold higher mRNA dose was required to achieve mutagenesis with CoDA ZFNs than with CompoZr ZFNs, and their somatic lesion frequency was lower (<5%) when compared to CopmoZr ZFNs (9-98%). This work provides high throughput protocols for efficient generation of zebrafish mutants using ZFNs and TALENs.
Project description:Many important questions in biology are, fundamentally, comparative, and this extends to our analysis of a growing number of sequenced genomes. Existing genomic analysis tools are often organized around literal views of genomes as linear strings. Even when information is highly condensed, these views grow cumbersome as larger numbers of genomes are added. Data aggregation and summarization methods from the field of visual analytics can provide abstracted comparative views, suitable for sifting large multi-genome datasets to identify critical similarities and differences. We introduce a software system for visual analysis of comparative genomics data. The system automates the process of data integration, and provides the analysis platform to identify and explore features of interest within these large datasets. GenoSets borrows techniques from business intelligence and visual analytics to provide a rich interface of interactive visualizations supported by a multi-dimensional data warehouse. In GenoSets, visual analytic approaches are used to enable querying based on orthology, functional assignment, and taxonomic or user-defined groupings of genomes. GenoSets links this information together with coordinated, interactive visualizations for both detailed and high-level categorical analysis of summarized data. GenoSets has been designed to simplify the exploration of multiple genome datasets and to facilitate reasoning about genomic comparisons. Case examples are included showing the use of this system in the analysis of 12 Brucella genomes. GenoSets software and the case study dataset are freely available at http://genosets.uncc.edu. We demonstrate that the integration of genomic data using a coordinated multiple view approach can simplify the exploration of large comparative genomic data sets, and facilitate reasoning about comparisons and features of interest.
Project description:BackgroundResearchers applying compositional data analysis to time-use data (e.g., time spent in physical behaviors) often face the problem of zeros, that is, recordings of zero time spent in any of the studied behaviors. Zeros hinder the application of compositional data analysis because the analysis is based on log-ratios. One way to overcome this challenge is to replace the zeros with sensible small values. The aim of this study was to compare the performance of three existing replacement methods used within physical behavior time-use epidemiology: simple replacement, multiplicative replacement, and log-ratio expectation-maximization (lrEM) algorithm. Moreover, we assessed the consequence of choosing replacement values higher than the lowest observed value for a given behavior.MethodUsing a complete dataset based on accelerometer data from 1310 Danish adults as reference, multiple datasets were simulated across six scenarios of zeros (5-30% zeros in 5% increments). Moreover, four examples were produced based on real data, in which, 10 and 20% zeros were imposed and replaced using a replacement value of 0.5 min, 65% of the observation threshold, or an estimated value below the observation threshold. For the simulation study and the examples, the zeros were replaced using the three replacement methods and the degree of distortion introduced was assessed by comparison with the complete dataset.ResultsThe lrEM method outperformed the other replacement methods as it had the smallest influence on the structure of relative variation of the datasets. Both the simple and multiplicative replacements introduced higher distortion, particularly in scenarios with more than 10% zeros; although the latter, like the lrEM, does preserve the ratios between behaviors with no zeros. The examples revealed that replacing zeros with a value higher than the observation threshold severely affected the structure of relative variation.ConclusionsGiven our findings, we encourage the use of replacement methods that preserve the relative structure of physical behavior data, as achieved by the multiplicative and lrEM replacements, and to avoid simple replacement. Moreover, we do not recommend replacing zeros with values higher than the lowest observed value for a behavior.