Unknown

Dataset Information

0

Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990-2023).


ABSTRACT: This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations. Images were preprocessed with OpenCV in order to remove borders and split figures containing multiple images, data were analyzed and described, and a subset was randomly selected for quality assessment. The dataset's structure allows for seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision or multi-modal models.

SUBMITTER: Nievas Offidani MA 

PROVIDER: S-EPMC10792687 | biostudies-literature | 2024 Feb

REPOSITORIES: biostudies-literature

altmetric image

Publications

Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990-2023).

Nievas Offidani Mauro Andrés MA   Delrieux Claudio Augusto CA  

Data in brief 20231223


This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels  ...[more]

Similar Datasets

| S-EPMC7148228 | biostudies-literature
| EMPIAR-12287 | biostudies-other
| S-EPMC10951928 | biostudies-literature
| S-EPMC7033320 | biostudies-literature
| S-EPMC7063128 | biostudies-literature
| S-EPMC10370840 | biostudies-literature
| S-EPMC6309178 | biostudies-literature
| S-EPMC10139909 | biostudies-literature
| S-EPMC9508436 | biostudies-literature
| S-EPMC6906728 | biostudies-literature