Project description:BackgroundMulti-cellular segmentation of bright field microscopy images is an essential computational step when quantifying collective migration of cells in vitro. Despite the availability of various tools and algorithms, no publicly available benchmark has been proposed for evaluation and comparison between the different alternatives.DescriptionA uniform framework is presented to benchmark algorithms for multi-cellular segmentation in bright field microscopy images. A freely available set of 171 manually segmented images from diverse origins was partitioned into 8 datasets and evaluated on three leading designated tools.ConclusionsThe presented benchmark resource for evaluating segmentation algorithms of bright field images is the first public annotated dataset for this purpose. This annotated dataset of diverse examples allows fair evaluations and comparisons of future segmentation methods. Scientists are encouraged to assess new algorithms on this benchmark, and to contribute additional annotated datasets.
Project description:This paper presents a large publicly available multi-center lumbar spine magnetic resonance imaging (MRI) dataset with reference segmentations of vertebrae, intervertebral discs (IVDs), and spinal canal. The dataset includes 447 sagittal T1 and T2 MRI series from 218 patients with a history of low back pain and was collected from four different hospitals. An iterative data annotation approach was used by training a segmentation algorithm on a small part of the dataset, enabling semi-automatic segmentation of the remaining images. The algorithm provided an initial segmentation, which was subsequently reviewed, manually corrected, and added to the training data. We provide reference performance values for this baseline algorithm and nnU-Net, which performed comparably. Performance values were computed on a sequestered set of 39 studies with 97 series, which were additionally used to set up a continuous segmentation challenge that allows for a fair comparison of different segmentation algorithms. This study may encourage wider collaboration in the field of spine segmentation and improve the diagnostic value of lumbar spine MRI.
Project description:PurposeTo generate the first open dataset of retinal parafoveal optical coherence tomography angiography (OCTA) images with associated ground truth manual segmentations, and to establish a standard for OCTA image segmentation by surveying a broad range of state-of-the-art vessel enhancement and binarization procedures.MethodsHandcrafted filters and neural network architectures were used to perform vessel enhancement. Thresholding methods and machine learning approaches were applied to obtain the final binarization. Evaluation was performed by using pixelwise metrics and newly proposed topological metrics. Finally, we compare the error in the computation of clinically relevant vascular network metrics (e.g., foveal avascular zone area and vessel density) across segmentation methods.ResultsOur results show that, for the set of images considered, deep learning architectures (U-Net and CS-Net) achieve the best performance (Dice = 0.89). For applications where manually segmented data are not available to retrain these approaches, our findings suggest that optimally oriented flux (OOF) is the best handcrafted filter (Dice = 0.86). Moreover, our results show up to 25% differences in vessel density accuracy depending on the segmentation method used.ConclusionsIn this study, we derive and validate the first open dataset of retinal parafoveal OCTA images with associated ground truth manual segmentations. Our findings should be taken into account when comparing the results of clinical studies and performing meta-analyses. Finally, we release our data and source code to support standardization efforts in OCTA image segmentation.Translational relevanceThis work establishes a standard for OCTA retinal image segmentation and introduces the importance of evaluating segmentation performance in terms of clinically relevant metrics.
Project description:We present a new approach to segment and classify bacterial spore layers from Transmission Electron Microscopy (TEM) images using a hybrid Convolutional Neural Network (CNN) and Random Forest (RF) classifier algorithm. This approach utilizes deep learning, with the CNN extracting features from images, and the RF classifier using those features for classification. The proposed model achieved 73% accuracy, 64% precision, 46% sensitivity, and 47% F1-score with test data. Compared to other classifiers such as AdaBoost, XGBoost, and SVM, our proposed model demonstrates greater robustness and higher generalization ability for non-linear segmentation. Our model is also able to identify spores with a damaged core as verified using TEMs of chemically exposed spores. Therefore, the proposed method will be valuable for identifying and characterizing spore features in TEM images, reducing labor-intensive work as well as human bias.
Project description:Images and gpr files were examined using a novel saturation reduction method to determine whether accuracy could be improved by extending dynamic range of saturated pixels Three immunosignatures from human Valley Fever (Coccidiodes) patients and three immunosignatures from human influenza vaccine recipients were examined to test an algorithm that extends the apparent dynamic range of a fluorescence image. These images had several saturated spots at 70PMT and 100% laser power. The program examined the differences between Valley Fever and influenza in terms of standard image processing vs. segmentation and intensity estimation.
Project description:Images and gpr files were examined using a novel saturation reduction method to determine whether accuracy could be improved by extending dynamic range of saturated pixels
Project description:Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size, tasks, material systems, and data modalities present in the materials informatics literature. This makes it difficult to identify optimal machine learning model choices including algorithm, model architecture, data splitting, and data featurization for a given task. Here, we attempt to address this lack of benchmark datasets by assembling a unique repository of 50 different datasets for materials properties. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. Data were extracted from 16 publications. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. These benchmark data can serve as a basis for a more diverse benchmark dataset in the future to further improve their effectiveness in the comparison of machine learning models.
Project description:BackgroundBenchmark datasets are essential for both method development and performance assessment. These datasets have numerous requirements, representativeness being one. In the case of variant tolerance/pathogenicity prediction, representativeness means that the dataset covers the space of variations and their effects.ResultsWe performed the first analysis of the representativeness of variation benchmark datasets. We used statistical approaches to investigate how proteins in the benchmark datasets were representative for the entire human protein universe. We investigated the distributions of variants in chromosomes, protein structures, CATH domains and classes, Pfam protein families, Enzyme Commission (EC) classifications and Gene Ontology annotations in 24 datasets that have been used for training and testing variant tolerance prediction methods. All the datasets were available in VariBench or VariSNP databases. We tested also whether the pathogenic variant datasets contained neutral variants defined as those that have high minor allele frequency in the ExAC database. The distributions of variants over the chromosomes and proteins varied greatly between the datasets.ConclusionsNone of the datasets was found to be well representative. Many of the tested datasets had quite good coverage of the different protein characteristics. Dataset size correlates to representativeness but only weakly to the performance of methods trained on them. The results imply that dataset representativeness is an important factor and should be taken into account in predictor development and testing.
Project description:Applying deep learning to images of cropping systems provides new knowledge and insights in research and commercial applications. Semantic segmentation or pixel-wise classification, of RGB images acquired at the ground level, into vegetation and background is a critical step in the estimation of several canopy traits. Current state of the art methodologies based on convolutional neural networks (CNNs) are trained on datasets acquired under controlled or indoor environments. These models are unable to generalize to real-world images and hence need to be fine-tuned using new labelled datasets. This motivated the creation of the VegAnn - Vegetation Annotation - dataset, a collection of 3775 multi-crop RGB images acquired for different phenological stages using different systems and platforms in diverse illumination conditions. We anticipate that VegAnn will help improving segmentation algorithm performances, facilitate benchmarking and promote large-scale crop vegetation segmentation research.
Project description:BackgroundSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset.MethodsWe identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study.ResultsThe benchmark datasets focus on the two most widely used sequencing platforms: long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets: three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub: https://github.com/CDCgov/datasets-sars-cov-2.DiscussionThe datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines.