Project description:Bacteriophages are the most abundant biological entity on the planet, but at the same time do not account for much of the genetic material isolated from most environments due to their small genome sizes. They also show great genetic diversity and mosaic genomes making it challenging to analyze and understand them. Here we present MetaPhinder, a method to identify assembled genomic fragments (i.e.contigs) of phage origin in metagenomic data sets. The method is based on a comparison to a database of whole genome bacteriophage sequences, integrating hits to multiple genomes to accomodate for the mosaic genome structure of many bacteriophages. The method is demonstrated to out-perform both BLAST methods based on single hits and methods based on k-mer comparisons. MetaPhinder is available as a web service at the Center for Genomic Epidemiology https://cge.cbs.dtu.dk/services/MetaPhinder/, while the source code can be downloaded from https://bitbucket.org/genomicepidemiology/metaphinder or https://github.com/vanessajurtz/MetaPhinder.
Project description:Virophages, e.g., Sputnik, Mavirus, and Organic Lake virophage (OLV), are unusual parasites of giant double-stranded DNA (dsDNA) viruses, yet little is known about their diversity. Here, we describe the global distribution, abundance, and genetic diversity of virophages based on analyzing and mapping comprehensive metagenomic databases. The results reveal a distinct abundance and worldwide distribution of virophages, involving almost all geographical zones and a variety of unique environments. These environments ranged from deep ocean to inland, iced to hydrothermal lakes, and human gut- to animal-associated habitats. Four complete virophage genomic sequences (Yellowstone Lake virophages [YSLVs]) were obtained, as was one nearly complete sequence (Ace Lake Mavirus [ALM]). The genomes obtained were 27,849 bp long with 26 predicted open reading frames (ORFs) (YSLV1), 23,184 bp with 21 ORFs (YSLV2), 27,050 bp with 23 ORFs (YSLV3), 28,306 bp with 34 ORFs (YSLV4), and 17,767 bp with 22 ORFs (ALM). The homologous counterparts of five genes, including putative FtsK-HerA family DNA packaging ATPase and genes encoding DNA helicase/primase, cysteine protease, major capsid protein (MCP), and minor capsid protein (mCP), were present in all virophages studied thus far. They also shared a conserved gene cluster comprising the two core genes of MCP and mCP. Comparative genomic and phylogenetic analyses showed that YSLVs, having a closer relationship to each other than to the other virophages, were more closely related to OLV than to Sputnik but distantly related to Mavirus and ALM. These findings indicate that virophages appear to be widespread and genetically diverse, with at least 3 major lineages.
Project description:When we collect the growth curves of many individuals, orderly variation in the curves is often observed rather than a completely random mixture of various curves. Small individuals may exhibit similar growth curves, but the curves differ from those of large individuals, whereby the curves gradually vary from small to large individuals. It has been recognized that after standardization with the asymptotes, if all the growth curves are the same (anamorphic growth curve set), the growth curve sets can be estimated using nonchronological data; otherwise, that is, if the growth curves are not identical after standardization with the asymptotes (polymorphic growth curve set), this estimation is not feasible. However, because a given set of growth curves determines the variation in the observed data, it may be possible to estimate polymorphic growth curve sets using nonchronological data.In this study, we developed an estimation method by deriving the likelihood function for polymorphic growth curve sets. The method involves simple maximum likelihood estimation. The weighted nonlinear regression and least-squares method after the log-transform of the anamorphic growth curve sets were included as special cases.The growth curve sets of the height of cypress (Chamaecyparis obtusa) and larch (Larix kaempferi) trees were estimated. With the model selection process using the AIC and likelihood ratio test, the growth curve set for cypress was found to be polymorphic, whereas that for larch was found to be anamorphic. Improved fitting using the polymorphic model for cypress is due to resolving underdispersion (less dispersion in real data than model prediction).The likelihood function for model estimation depends not only on the distribution type of asymptotes, but the definition of the growth curve set as well. Consideration of these factors may be necessary, even if environmental explanatory variables and random effects are introduced.
Project description:BackgroundDe-identification is a common way to protect patient privacy when disclosing clinical data for secondary purposes, such as research. One type of attack that de-identification protects against is linking the disclosed patient data with public and semi-public registries. Uniqueness is a commonly used measure of re-identification risk under this attack. If uniqueness can be measured accurately then the risk from this kind of attack can be managed. In practice, it is often not possible to measure uniqueness directly, therefore it must be estimated.MethodsWe evaluated the accuracy of uniqueness estimators on clinically relevant data sets. Four candidate estimators were identified because they were evaluated in the past and found to have good accuracy or because they were new and not evaluated comparatively before: the Zayatz estimator, slide negative binomial estimator, Pitman's estimator, and mu-argus. A Monte Carlo simulation was performed to evaluate the uniqueness estimators on six clinically relevant data sets. We varied the sampling fraction and the uniqueness in the population (the value being estimated). The median relative error and inter-quartile range of the uniqueness estimates was measured across 1000 runs.ResultsThere was no single estimator that performed well across all of the conditions. We developed a decision rule which selected between the Pitman, slide negative binomial and Zayatz estimators depending on the sampling fraction and the difference between estimates. This decision rule had the best consistent median relative error across multiple conditions and data sets.ConclusionThis study identified an accurate decision rule that can be used by health privacy researchers and disclosure control professionals to estimate uniqueness in clinical data sets. The decision rule provides a reliable way to measure re-identification risk.
Project description:Although plasmids are important for bacterial survival and adaptation, plasmid detection and assembly from genomic, let alone metagenomic, samples remain challenging. The recently developed plasmidSPAdes assembler addressed some of these challenges in the case of isolate genomes but stopped short of detecting plasmids in metagenomic assemblies, an untapped source of yet to be discovered plasmids. We present the metaplasmidSPAdes tool for plasmid assembly in metagenomic data sets that reduced the false positive rate of plasmid detection compared with the state-of-the-art approaches. We assembled plasmids in diverse data sets and have shown that thousands of plasmids remained below the radar in already completed genomic and metagenomic studies. Our analysis revealed the extreme variability of plasmids and has led to the discovery of many novel plasmids (including many plasmids carrying antibiotic-resistance genes) without significant similarities to currently known ones.
Project description:While epigenetics continues to be a burgeoning research area in neuroscience, unaddressed issues related to data reproducibility across laboratories remain. Indeed, separating meaningful experimental changes from background variability is a challenge in epigenomic studies. Genome-wide DNA methylation analysis of hippocampal tissues from wild-type rats across three independent laboratories revealed that seemingly minor protocol differences resulted in significant epigenome profile changes, even in the absence of experimental intervention. Difficult-to-match factors such as animal vendors and a subset of husbandry and tissue extraction procedures produced quantifiable variations between wild-type animals across the three laboratories. To enhance scientific rigor, we conclude that strict adherence to protocols is necessary for the execution and interpretation of epigenetic studies and that protocol-sensitive epigenetic changes, amongst naive animals, may confound experimental results.
Project description:While epigenetics continues to be a burgeoning research area in neuroscience, unaddressed issues related to data reproducibility across laboratories remain. Indeed, separating meaningful experimental changes from background variability is a challenge in epigenomic studies. Genome-wide DNA methylation analysis of hippocampal tissues from wild-type rats across three independent laboratories revealed that seemingly minor protocol differences resulted in significant epigenome profile changes, even in the absence of experimental intervention. Difficult-to-match factors such as animal vendors and a subset of husbandry and tissue extraction procedures produced quantifiable variations between wild-type animals across the three laboratories. To enhance scientific rigor, we conclude that strict adherence to protocols is necessary for the execution and interpretation of epigenetic studies and that protocol-sensitive epigenetic changes, amongst naive animals, may confound experimental results.
Project description:BackgroundMicrohaplotypes have the potential to be more cost-effective than SNPs for applications that require genetic panels of highly variable loci. However, development of microhaplotype panels is hindered by a lack of methods for estimating microhaplotype allele frequency from low-coverage whole genome sequencing or pooled sequencing (pool-seq) data.ResultsWe developed new methods for estimating microhaplotype allele frequency from low-coverage whole genome sequence and pool-seq data. We validated these methods using datasets from three non-model organisms. These methods allowed estimation of allele frequency and expected heterozygosity at depths routinely achieved from pooled sequencing.ConclusionsThese new methods will allow microhaplotype panels to be designed using low-coverage WGS and pool-seq data to discover and evaluate candidate loci. The python script implementing the two methods and documentation are available at https://www.github.com/delomast/mhFromLowDepSeq .
Project description:Exome and whole-genome sequencing studies are becoming increasingly common, but little is known about the accuracy of the genotype calls made by the commonly used platforms. Here we use replicate high-coverage sequencing of blood and saliva DNA samples from four European-American individuals to estimate lower bounds on the error rates of Complete Genomics and Illumina HiSeq whole-genome and whole-exome sequencing. Error rates for nonreference genotype calls range from 0.1% to 0.6%, depending on the platform and the depth of coverage. Additionally, we found (1) no difference in the error profiles or rates between blood and saliva samples; (2) Complete Genomics sequences had substantially higher error rates than Illumina sequences had; (3) error rates were higher (up to 6%) for rare or unique variants; (4) error rates generally declined with genotype quality (GQ) score, but in a nonlinear fashion for the Illumina data, likely due to loss of specificity of GQ scores greater than 60; and (5) error rates increased with increasing depth of coverage for the Illumina data. These findings, especially (3)-(5), suggest that caution should be taken in interpreting the results of next-generation sequencing-based association studies, and even more so in clinical application of this technology in the absence of validation by other more robust sequencing or genotyping methods.