Project description:Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structures of biological macromolecular complexes. Picking single-protein particles from cryo-EM micrographs is a crucial step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though machine learning and artificial intelligence (AI) based particle picking can potentially automate the process, its development is hindered by lack of large, high-quality labelled training data. To address this bottleneck, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for protein particle picking and analysis. It consists of labelled cryo-EM micrographs (images) of 34 representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). The dataset is 2.6 terabytes and includes 9,893 high-resolution micrographs with labelled protein particle coordinates. The labelling process was rigorously validated through 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of both AI and classical methods for automated cryo-EM protein particle picking.
Project description:MotivationThe inherent low contrast of electron microscopy (EM) datasets presents a significant challenge for rapid segmentation of cellular ultrastructures from EM data. This challenge is particularly prominent when working with high-resolution big-datasets that are now acquired using electron tomography and serial block-face imaging techniques. Deep learning (DL) methods offer an exciting opportunity to automate the segmentation process by learning from manual annotations of a small sample of EM data. While many DL methods are being rapidly adopted to segment EM data no benchmark analysis has been conducted on these methods to date.ResultsWe present EM-stellar, a platform that is hosted on Google Colab that can be used to benchmark the performance of a range of state-of-the-art DL methods on user-provided datasets. Using EM-stellar we show that the performance of any DL method is dependent on the properties of the images being segmented. It also follows that no single DL method performs consistently across all performance evaluation metrics.Availability and implementationEM-stellar (code and data) is written in Python and is freely available under MIT license on GitHub (https://github.com/cellsmb/em-stellar).Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:This paper contains datasets related to the “Efficient Deep Learning Models for Categorizing Chenopodiaceae in the wild” (Heidary-Sharifabad et al., 2021). There are about 1500 species of Chenopodiaceae that are spread worldwide and often are ecologically important. Biodiversity conservation of these species is critical due to the destructive effects of human activities on them. For this purpose, identification and surveillance of Chenopodiaceae species in their natural habitat are necessary and can be facilitated by deep learning. The feasibility of applying deep learning algorithms to identify Chenopodiaceae species depends on access to the appropriate relevant dataset. Therefore, ACHENY dataset was collected from natural habitats of different bushes of Chenopodiaceae species, in real-world conditions from desert and semi-desert areas of the Yazd province of IRAN. This imbalanced dataset is compiled of 27,030 RGB color images from 30 Chenopodiaceae species, each species 300-1461 images. Imaging is performed from multiple bushes for each species, with different camera-to-target distances, viewpoints, angles, and natural sunlight in November and December. The collected images are not pre-processed, only are resized to 224 × 224 dimensions which can be used on some of the successful deep learning models and then were grouped into their respective class. The images in each class are separated by 10% for testing, 18% for validation, and 72% for training. Test images are often manually selected from plant bushes different from the training set. Then training and validation images are randomly separated from the remaining images in each category. The small-sized images with 64 × 64 dimensions also are included in ACHENY which can be used on some other deep models.
Project description:Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (∼300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp.
Project description:This data article provides details for the RDD2020 dataset comprising 26,336 road images from India, Japan, and the Czech Republic with more than 31,000 instances of road damage. The dataset captures four types of road damage: longitudinal cracks, transverse cracks, alligator cracks, and potholes; and is intended for developing deep learning-based methods to detect and classify road damage automatically. The images in RDD2020 were captured using vehicle-mounted smartphones, making it useful for municipalities and road agencies to develop methods for low-cost monitoring of road pavement surface conditions. Further, the machine learning researchers can use the datasets for benchmarking the performance of different algorithms for solving other problems of the same type (image classification, object detection, etc.). RDD2020 is freely available at [1]. The latest updates and the corresponding articles related to the dataset can be accessed at [2].
Project description:Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 106 unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz.
Project description:Deep neural networks provide the current best models of visual information processing in the primate brain. Drawing on work from computer vision, the most commonly used networks are pretrained on data from the ImageNet Large Scale Visual Recognition Challenge. This dataset comprises images from 1,000 categories, selected to provide a challenging testbed for automated visual object recognition systems. Moving beyond this common practice, we here introduce ecoset, a collection of >1.5 million images from 565 basic-level categories selected to better capture the distribution of objects relevant to humans. Ecoset categories were chosen to be both frequent in linguistic usage and concrete, thereby mirroring important physical objects in the world. We test the effects of training on this ecologically more valid dataset using multiple instances of two neural network architectures: AlexNet and vNet, a novel architecture designed to mimic the progressive increase in receptive field sizes along the human ventral stream. We show that training on ecoset leads to significant improvements in predicting representations in human higher-level visual cortex and perceptual judgments, surpassing the previous state of the art. Significant and highly consistent benefits are demonstrated for both architectures on two separate functional magnetic resonance imaging (fMRI) datasets and behavioral data, jointly covering responses to 1,292 visual stimuli from a wide variety of object categories. These results suggest that computational visual neuroscience may take better advantage of the deep learning framework by using image sets that reflect the human perceptual and cognitive experience. Ecoset and trained network models are openly available to the research community.
Project description:Background and purposeCOVID-19 is a new strain of viruses that causes life stoppage worldwide. At this time, the new coronavirus COVID-19 is spreading rapidly across the world and poses a threat to people's health. Experimental medical tests and analysis have shown that the infection of lungs occurs in almost all COVID-19 patients. Although Computed Tomography of the chest is a useful imaging method for diagnosing diseases related to the lung, chest X-ray (CXR) is more widely available, mainly due to its lower price and results. Deep learning (DL), one of the significant popular artificial intelligence techniques, is an effective way to help doctors analyze how a large number of CXR images is crucial to performance.Materials and methodsIn this article, we propose a novel perceptual two-layer image fusion using DL to obtain more informative CXR images for a COVID-19 dataset. To assess the proposed algorithm performance, the dataset used for this work includes 87 CXR images acquired from 25 cases, all of which were confirmed with COVID-19. The dataset preprocessing is needed to facilitate the role of convolutional neural networks (CNN). Thus, hybrid decomposition and fusion of Nonsubsampled Contourlet Transform (NSCT) and CNN_VGG19 as feature extractor was used.ResultsOur experimental results show that imbalanced COVID-19 datasets can be reliably generated by the algorithm established here. Compared to the COVID-19 dataset used, the fuzed images have more features and characteristics. In evaluation performance measures, six metrics are applied, such as QAB/F, QMI, PSNR, SSIM, SF, and STD, to determine the evaluation of various medical image fusion (MIF). In the QMI, PSNR, SSIM, the proposed algorithm NSCT + CNN_VGG19 achieves the greatest and the features characteristics found in the fuzed image is the largest. We can deduce that the proposed fusion algorithm is efficient enough to generate CXR COVID-19 images that are more useful for the examiner to explore patient status.ConclusionsA novel image fusion algorithm using DL for an imbalanced COVID-19 dataset is the crucial contribution of this work. Extensive results of the experiment display that the proposed algorithm NSCT + CNN_VGG19 outperforms competitive image fusion algorithms.