Project description:Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 106 unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz.
Project description:MotivationThe inherent low contrast of electron microscopy (EM) datasets presents a significant challenge for rapid segmentation of cellular ultrastructures from EM data. This challenge is particularly prominent when working with high-resolution big-datasets that are now acquired using electron tomography and serial block-face imaging techniques. Deep learning (DL) methods offer an exciting opportunity to automate the segmentation process by learning from manual annotations of a small sample of EM data. While many DL methods are being rapidly adopted to segment EM data no benchmark analysis has been conducted on these methods to date.ResultsWe present EM-stellar, a platform that is hosted on Google Colab that can be used to benchmark the performance of a range of state-of-the-art DL methods on user-provided datasets. Using EM-stellar we show that the performance of any DL method is dependent on the properties of the images being segmented. It also follows that no single DL method performs consistently across all performance evaluation metrics.Availability and implementationEM-stellar (code and data) is written in Python and is freely available under MIT license on GitHub (https://github.com/cellsmb/em-stellar).Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:This paper contains datasets related to the “Efficient Deep Learning Models for Categorizing Chenopodiaceae in the wild” (Heidary-Sharifabad et al., 2021). There are about 1500 species of Chenopodiaceae that are spread worldwide and often are ecologically important. Biodiversity conservation of these species is critical due to the destructive effects of human activities on them. For this purpose, identification and surveillance of Chenopodiaceae species in their natural habitat are necessary and can be facilitated by deep learning. The feasibility of applying deep learning algorithms to identify Chenopodiaceae species depends on access to the appropriate relevant dataset. Therefore, ACHENY dataset was collected from natural habitats of different bushes of Chenopodiaceae species, in real-world conditions from desert and semi-desert areas of the Yazd province of IRAN. This imbalanced dataset is compiled of 27,030 RGB color images from 30 Chenopodiaceae species, each species 300-1461 images. Imaging is performed from multiple bushes for each species, with different camera-to-target distances, viewpoints, angles, and natural sunlight in November and December. The collected images are not pre-processed, only are resized to 224 × 224 dimensions which can be used on some of the successful deep learning models and then were grouped into their respective class. The images in each class are separated by 10% for testing, 18% for validation, and 72% for training. Test images are often manually selected from plant bushes different from the training set. Then training and validation images are randomly separated from the remaining images in each category. The small-sized images with 64 × 64 dimensions also are included in ACHENY which can be used on some other deep models.
Project description:Pre-clinical studies reported the immunogenic or immunomodulatory effects of traditional cancer therapies. However, the publicly available well-curated and harmonized breast cancer datasets, such as TCGA, METABRIC, and MetaGxBreast, lack careful curation of treatment regimens. Hence, limited exploration of the impact of therapies on the prognostic/predictive value of breast cancer biomarkers. Herein, we describe a pooled and treatment-curated gene-expression dataset to investigate the impact of treatments on the prognostic/predictive value of biomarkers. We searched the gene expression omnibus database to identify potential human breast cancer gene-expression datasets with anthracycline/taxane treatment. Published datasets with the detailed treatment regimen, clinical endpoint, clinical-pathological, and gene-expression data were extracted and harmonized. The dataset described herein would help researchers explore the interaction between gene-expression biomarkers and immunogenic/immunomodulatory treatments in breast cancer.
Project description:This data article provides details for the RDD2020 dataset comprising 26,336 road images from India, Japan, and the Czech Republic with more than 31,000 instances of road damage. The dataset captures four types of road damage: longitudinal cracks, transverse cracks, alligator cracks, and potholes; and is intended for developing deep learning-based methods to detect and classify road damage automatically. The images in RDD2020 were captured using vehicle-mounted smartphones, making it useful for municipalities and road agencies to develop methods for low-cost monitoring of road pavement surface conditions. Further, the machine learning researchers can use the datasets for benchmarking the performance of different algorithms for solving other problems of the same type (image classification, object detection, etc.). RDD2020 is freely available at [1]. The latest updates and the corresponding articles related to the dataset can be accessed at [2].
Project description:The recent surge in machine learning augmented turbulence modelling is a promising approach for addressing the limitations of Reynolds-averaged Navier-Stokes (RANS) models. This work presents the development of the first open-source dataset, curated and structured for immediate use in machine learning augmented corrective turbulence closure modelling. The dataset features a variety of RANS simulations with matching direct numerical simulation (DNS) and large-eddy simulation (LES) data. Four turbulence models are selected to form the initial dataset: k-ε, k-ε-ϕt-f, k-ω, and k-ω SST. The dataset consists of 29 cases per turbulence model, for several parametrically sweeping reference DNS/LES cases: periodic hills, square duct, parametric bumps, converging-diverging channel, and a curved backward-facing step. At each of the 895,640 points, various RANS features with DNS/LES labels are available. The feature set includes quantities used in current state-of-the-art models, and additional fields which enable the generation of new feature sets. The dataset reduces effort required to train, test, and benchmark new corrective RANS models. The dataset is available at https://doi.org/10.34740/kaggle/dsv/2637500 .
Project description:The 26S proteasome is a 2.5 MDa molecular machine that executes the degradation of substrates of the ubiquitin-proteasome pathway. The molecular architecture of the 26S proteasome was recently established by cryo-EM approaches. For a detailed understanding of the sequence of events from the initial binding of polyubiquitylated substrates to the translocation into the proteolytic core complex, it is necessary to move beyond static structures and characterize the conformational landscape of the 26S proteasome. To this end we have subjected a large cryo-EM dataset acquired in the presence of ATP and ATP-?S to a deep classification procedure, which deconvolutes coexisting conformational states. Highly variable regions, such as the density assigned to the largest subunit, Rpn1, are now well resolved and rendered interpretable. Our analysis reveals the existence of three major conformations: in addition to the previously described ATP-hydrolyzing (ATPh) and ATP-?S conformations, an intermediate state has been found. Its AAA-ATPase module adopts essentially the same topology that is observed in the ATPh conformation, whereas the lid is more similar to the ATP-?S bound state. Based on the conformational ensemble of the 26S proteasome in solution, we propose a mechanistic model for substrate recognition, commitment, deubiquitylation, and translocation into the core particle.
Project description:Deep neural networks provide the current best models of visual information processing in the primate brain. Drawing on work from computer vision, the most commonly used networks are pretrained on data from the ImageNet Large Scale Visual Recognition Challenge. This dataset comprises images from 1,000 categories, selected to provide a challenging testbed for automated visual object recognition systems. Moving beyond this common practice, we here introduce ecoset, a collection of >1.5 million images from 565 basic-level categories selected to better capture the distribution of objects relevant to humans. Ecoset categories were chosen to be both frequent in linguistic usage and concrete, thereby mirroring important physical objects in the world. We test the effects of training on this ecologically more valid dataset using multiple instances of two neural network architectures: AlexNet and vNet, a novel architecture designed to mimic the progressive increase in receptive field sizes along the human ventral stream. We show that training on ecoset leads to significant improvements in predicting representations in human higher-level visual cortex and perceptual judgments, surpassing the previous state of the art. Significant and highly consistent benefits are demonstrated for both architectures on two separate functional magnetic resonance imaging (fMRI) datasets and behavioral data, jointly covering responses to 1,292 visual stimuli from a wide variety of object categories. These results suggest that computational visual neuroscience may take better advantage of the deep learning framework by using image sets that reflect the human perceptual and cognitive experience. Ecoset and trained network models are openly available to the research community.