Project description:Freely-available satellite data streams and the ability to process these data on cloud-computing platforms such as Google Earth Engine have made frequent, large-scale landcover mapping at high resolution a real possibility. In this paper we apply these technologies, along with machine learning, to the mapping of peatlands-a landcover class that is critical for preserving biodiversity, helping to address climate change impacts, and providing ecosystem services, e.g., carbon storage-in the Boreal Forest Natural Region of Alberta, Canada. We outline a data-driven, scientific framework that: compiles large amounts of Earth observation data sets (radar, optical, and LiDAR); examines the extracted variables for suitability in peatland modelling; optimizes model parameterization; and finally, predicts peatland occurrence across a large boreal area (397, 958 km2) of Alberta at 10 m spatial resolution (equalling 3.9 billion pixels across Alberta). The resulting peatland occurrence model shows an accuracy of 87% and a kappa statistic of 0.57 when compared to our validation data set. Differentiating peatlands from mineral wetlands achieved an accuracy of 69% and kappa statistic of 0.37. This data-driven approach is applicable at large geopolitical scales (e.g., provincial, national) for wetland and landcover inventories that support long-term, responsible resource management.
Project description:We describe the Open Diffusion Data Derivatives (O3D) repository: an integrated collection of preserved brain data derivatives and processing pipelines, published together using a single digital-object-identifier. The data derivatives were generated using modern diffusion-weighted magnetic resonance imaging data (dMRI) with diverse properties of resolution and signal-to-noise ratio. In addition to the data, we publish all processing pipelines (also referred to as open cloud services). The pipelines utilize modern methods for neuroimaging data processing (diffusion-signal modelling, fiber tracking, tractography evaluation, white matter segmentation, and structural connectome construction). The O3D open services can allow cognitive and clinical neuroscientists to run the connectome mapping algorithms on new, user-uploaded, data. Open source code implementing all O3D services is also provided to allow computational and computer scientists to reuse and extend the processing methods. Publishing both data-derivatives and integrated processing pipeline promotes practices for scientific reproducibility and data upcycling by providing open access to the research assets for utilization by multiple scientific communities.
Project description:Motivation:The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform. Results:Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92-157 times faster than MetaSpark on metagenomic fragment recruitment and 18-32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21 h, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 h, presenting an approach to easily associate large amounts of public datasets with reference data. Availability and implementation:Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/. Contact:asczyrba@cebitec.uni-bielefeld.de. Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:The WRF-ACI model configuration is used to investigate the scale dependency of aerosol-cloud interactions (ACI) across the "grey zone" scales for grid and subgrid-scale clouds. The impacts of ACI on weather are examined across regions in the eastern and western U. S. at 36, 12, 4, and 1 km grid spacing for short-term periods during the summer of 2006. ACI impacts are determined by comparing simulations with current climatological aerosol levels to simulations with aerosol levels reduced by 90%. The aerosol-cloud lifetime effect is found to be the dominant process leading to suppressed precipitation in regions of the eastern U.S., while regions in the western U. S. experience offsetting impacts on precipitation from the cloud lifetime effect and other effects that enhance precipitation. Generally, the cloud lifetime effect weakens with decreasing grid spacing due to a decrease in relative importance of autoconversion compared to accretion. Subgrid-scale ACI are dominant at 36 km, while grid-scale ACI are dominant at 4 and 1 km. At 12 km grid spacing, grid-scale and subgrid-scale ACI processes are comparable in magnitude and spatial coverage, but random perturbations in grid-scale-ACI impacts make the overall grid-scale-ACI impact appear muted. This competing behavior of grid and subgrid-scale clouds complicate the understanding of ACI at 12 km within the current WRF modeling framework. The work implies including subgrid-scale-cloud microphysics and ice/mixed phase cloud ACI processes may be necessary in weather and climate models to study ACI effectively.
Project description:BackgroundA lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The metabolomics community has made substantial efforts to align with FAIR data standards by promoting open data formats, data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however, they tend to be inflexible and rely on the user to adequately report their methods and results. To enable FAIR data science in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive and intuitive for both computational novices and experts alike.Aim of reviewTo encourage metabolomics researchers from all backgrounds to take control of their own data science, mould it to their personal requirements, and enthusiastically share resources through open science.Key scientific concepts of reviewThis tutorial introduces the concept of interactive web-based computational laboratory notebooks. The reader is guided through a set of experiential tutorials specifically targeted at metabolomics researchers, based around the Jupyter Notebook web application, GitHub data repository, and Binder cloud computing platform.
Project description:Contact mapping experiments such as Hi-C explore how genomes fold in 3D. Here, we introduce Juicebox.js, a cloud-based web application for exploring the resulting datasets. Like the original Juicebox application, Juicebox.js allows users to zoom in and out of such datasets using an interface similar to Google Earth. Juicebox.js also has many features designed to facilitate data reproducibility and sharing. Furthermore, Juicebox.js encodes the exact state of the browser in a shareable URL. Creating a public browser for a new Hi-C dataset does not require coding and can be accomplished in under a minute. The web app also makes it possible to create interactive figures online that can complement or replace ordinary journal figures. When combined with Juicer, this makes the entire process of data analysis transparent, insofar as every step from raw reads to published figure is publicly available as open source code.
Project description:Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcare processes. On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existing data mining algorithms. This problem is addressed by proposing a cloud based system that integrates metalearning framework for ranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedical data.
Project description:In the field of genomic medical research, the amount of large-scale information continues to increase due to advances in measurement technologies, such as high-performance sequencing and spatial omics, as well as the progress made in genomic cohort studies involving more than one million individuals. Therefore, researchers require more computational resources to analyze this information. Here, we introduce a hybrid cloud system consisting of an on-premise supercomputer, science cloud, and public cloud at the Kyoto University Center for Genomic Medicine in Japan as a solution. This system can flexibly handle various heterogeneous computational resource-demanding bioinformatics tools while scaling the computational capacity. In the hybrid cloud system, we demonstrate the way to properly perform joint genotyping of whole-genome sequencing data for a large population of 11,238, which can be a bottleneck in sequencing data analysis. This system can be one of the reference implementations when dealing with large amounts of genomic medical data in research centers and organizations.
Project description:The biomedical data landscape is fragmented with several isolated, heterogeneous data and knowledge sources, which use varying formats, syntaxes, schemas, and entity notations, existing on the Web. Biomedical researchers face severe logistical and technical challenges to query, integrate, analyze, and visualize data from multiple diverse sources in the context of available biomedical knowledge. Semantic Web technologies and Linked Data principles may aid toward Web-scale semantic processing and data integration in biomedicine. The biomedical research community has been one of the earliest adopters of these technologies and principles to publish data and knowledge on the Web as linked graphs and ontologies, hence creating the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we provide our perspective on some opportunities proffered by the use of LSLOD to integrate biomedical data and knowledge in three domains: (1) pharmacology, (2) cancer research, and (3) infectious diseases. We will discuss some of the major challenges that hinder the wide-spread use and consumption of LSLOD by the biomedical research community. Finally, we provide a few technical solutions and insights that can address these challenges. Eventually, LSLOD can enable the development of scalable, intelligent infrastructures that support artificial intelligence methods for augmenting human intelligence to achieve better clinical outcomes for patients, to enhance the quality of biomedical research, and to improve our understanding of living systems.
Project description:A key aspect of neuroscience research is the development of powerful, general-purpose data analyses that process large datasets. Unfortunately, modern data analyses have a hidden dependence upon complex computing infrastructure (e.g., software and hardware), which acts as an unaddressed deterrent to analysis users. Although existing analyses are increasingly shared as open-source software, the infrastructure and knowledge needed to deploy these analyses efficiently still pose significant barriers to use. In this work, we develop Neuroscience Cloud Analysis As a Service (NeuroCAAS): a fully automated open-source analysis platform offering automatic infrastructure reproducibility for any data analysis. We show how NeuroCAAS supports the design of simpler, more powerful data analyses and that many popular data analysis tools offered through NeuroCAAS outperform counterparts on typical infrastructure. Pairing rigorous infrastructure management with cloud resources, NeuroCAAS dramatically accelerates the dissemination and use of new data analyses for neuroscientific discovery.