Project description:Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata.
Project description:Single-cell RNA sequencing (scRNA-seq) provides an unprecedented view of cellular diversity of biological systems. However, across the thousands of publications and datasets generated using this technology, we estimate that only a minority (<25%) of studies provide cell-level metadata information containing identified cell types and related findings of the published dataset. Metadata omission hinders reproduction, exploration, validation, and knowledge transfer and is a common problem across journals, data repositories, and publication dates. We encourage investigators, reviewers, journals, and data repositories to improve their standards and ensure proper documentation of these valuable datasets.
Project description:SummaryEstimating linkage disequilibrium (LD) is essential for a wide range of summary statistics-based association methods for genome-wide association studies. Large genetic datasets, e.g. the TOPMed WGS project and UK Biobank, enable more accurate and comprehensive LD estimates, but increase the computational burden of LD estimation. Here, we describe emeraLD (Efficient Methods for Estimation and Random Access of LD), a computational tool that leverages sparsity and haplotype structure to estimate LD up to 2 orders of magnitude faster than current tools.Availability and implementationemeraLD is implemented in C++, and is open source under GPLv3. Source code and documentation are freely available at http://github.com/statgen/emeraLD.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features), for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets.The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor.
Project description:BackgroundQuantification of gene expression from RNA-seq data is a prerequisite for transcriptome analysis such as differential gene expression analysis and gene co-expression network construction. Individual RNA-seq experiments are larger and combining multiple experiments from sequence repositories can result in datasets with thousands of samples. Processing hundreds to thousands of RNA-seq data can result in challenges related to data management, access to sufficient computational resources, navigation of high-performance computing (HPC) systems, installation of required software dependencies, and reproducibility. Processing of larger and deeper RNA-seq experiments will become more common as sequencing technology matures.ResultsGEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets. GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud. GEMmaker supports popular alignment and quantification tools providing results in raw and normalized formats. GEMmaker is unique in that it can scale to process thousands of local or remote stored samples without exceeding available data storage.ConclusionsWorkflows that quantify gene expression are not new, and many already address issues of portability, reusability, and scale in terms of access to CPUs. GEMmaker provides these benefits and adds the ability to scale despite low data storage infrastructure. This allows users to process hundreds to thousands of RNA-seq samples even when data storage resources are limited. GEMmaker is freely available and fully documented with step-by-step setup and execution instructions.
Project description:Large amounts of data from microbiome-related studies have been (and are currently being) deposited on international public databases. These datasets represent a valuable resource for the microbiome research community and could serve future researchers interested in integrating multiple datasets into powerful meta-analyses. However, this huge amount of data lacks harmonization and it is far from being completely exploited in its full potential to build a foundation that places microbiome research at the nexus of many subdisciplines within and beyond biology. Thus, it urges the need for data accessibility and reusability, according to findable, accessible, interoperable and reusable (FAIR) principles, as supported by National Microbiome Data Collaborative and FAIR Microbiome. To tackle the challenge of accelerating discovery and advances in skin microbiome research, we collected, integrated and organized existing microbiome data resources from human skin 16S rRNA amplicon-sequencing experiments. We generated a comprehensive collection of datasets, enriched in metadata, and organized this information into data frames ready to be integrated into microbiome research projects and advanced post-processing analyses, such as data science applications (e.g. machine learning). Furthermore, we have created a data retrieval and curation framework built on three different stages to maximize the retrieval of datasets and metadata associated with them. Lastly, we highlighted some caveats regarding metadata retrieval and suggested ways to improve future metadata submissions. Overall, our work resulted in a curated skin microbiome datasets collection accompanied by a state-of-the-art analysis of the last 10 years of the skin microbiome field. Database URL: https://github.com/giuliaago/SKIOMEMetadataRetrieval.
Project description:DamID is a powerful technique for identifying regions of the genome bound by a DNA-binding (or DNA-associated) protein. Currently no method exists for automatically processing next-generation sequencing DamID (DamID-seq) data, and the use of DamID-seq datasets with normalisation based on read-counts alone can lead to high background and the loss of bound signal. DamID-seq thus presents novel challenges in terms of normalisation and background minimisation. We describe here damidseq_pipeline, a software pipeline that performs automatic normalisation and background reduction on multiple DamID-seq FASTQ or BAM datasets. Single replicate profiling of pol II occupancy in 3rd instar larval neuroblasts of Drosophila
Project description:Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations becomes large. There is a burgeoning literature on approaches for analyzing large spatial datasets. In this article, we propose a divide-and-conquer strategy within the Bayesian paradigm. We partition the data into subsets, analyze each subset using a Bayesian spatial process model and then obtain approximate posterior inference for the entire dataset by combining the individual posterior distributions from each subset. Importantly, as often desired in spatial analysis, we offer full posterior predictive inference at arbitrary locations for the outcome as well as the residual spatial surface after accounting for spatially oriented predictors. We call this approach "Spatial Meta-Kriging" (SMK). We do not need to store the entire data in one processor, and this leads to superior scalability. We demonstrate SMK with various spatial regression models including Gaussian processes and tapered Gaussian processes. The approach is intuitive, easy to implement, and is supported by theoretical results presented in the supplementary material available online. Empirical illustrations are provided using different simulation experiments and a geostatistical analysis of Pacific Ocean sea surface temperature data.
Project description:BackgroundHigh throughput single-cell transcriptomic technology produces massive high-dimensional data, enabling high-resolution cell type definition and identification. To uncover the expressional patterns beneath the big data, a transcriptional landscape searching algorithm at a single-cell level is desirable.ResultsWe explored the feasibility of using DenseFly algorithm for cell searching on scRNA-seq data. DenseFly is a locality sensitive hashing algorithm inspired by the fruit fly olfactory system. The experiments indicate that DenseFly outperforms the baseline methods FlyHash and SimHash in classification tasks, and the performance is robust to dropout events and batch effects.ConclusionWe developed a method for mapping cells across scRNA-seq datasets based on the DenseFly algorithm. It can be an efficient tool for cell atlas searching.