Dataset Information

High-Resolution Swin Transformer for Automatic Medical Image Segmentation.

ABSTRACT: The resolution of feature maps is a critical factor for accurate medical image segmentation. Most of the existing Transformer-based networks for medical image segmentation adopt a U-Net-like architecture, which contains an encoder that converts the high-resolution input image into low-resolution feature maps using a sequence of Transformer blocks and a decoder that gradually generates high-resolution representations from low-resolution feature maps. However, the procedure of recovering high-resolution representations from low-resolution representations may harm the spatial precision of the generated segmentation masks. Unlike previous studies, in this study, we utilized the high-resolution network (HRNet) design style by replacing the convolutional layers with Transformer blocks, continuously exchanging feature map information with different resolutions generated by the Transformer blocks. The proposed Transformer-based network is named the high-resolution Swin Transformer network (HRSTNet). Extensive experiments demonstrated that the HRSTNet can achieve performance comparable with that of the state-of-the-art Transformer-based U-Net-like architecture on the 2021 Brain Tumor Segmentation dataset, the Medical Segmentation Decathlon's liver dataset, and the BTCV multi-organ segmentation dataset.

SUBMITTER: Wei C

PROVIDER: S-EPMC10099222 | biostudies-literature | 2023 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

High-Resolution Swin Transformer for Automatic Medical Image Segmentation.

Wei Chen C Ren Shenghan S Guo Kaitai K Hu Haihong H Liang Jimin J

Sensors (Basel, Switzerland) 20230324 7

The resolution of feature maps is a critical factor for accurate medical image segmentation. Most of the existing Transformer-based networks for medical image segmentation adopt a U-Net-like architecture, which contains an encoder that converts the high-resolution input image into low-resolution feature maps using a sequence of Transformer blocks and a decoder that gradually generates high-resolution representations from low-resolution feature maps. However, the procedure of recovering high-reso ...[more]

PMID: 37050479

Similar Datasets

Project description:BackgroundMedical image segmentation is crucial for improving healthcare outcomes. Convolutional neural networks (CNNs) have been widely applied in medical image analysis; however, their inherent inductive biases limit their ability to capture global contextual information. Vision transformer (ViT) architectures address this limitation by leveraging attention mechanisms to model global relationships; however, they typically require large-scale datasets for effective training, which is challenging in the field of medical imaging due to limited data availability. This study aimed to integrate the advantages of CNN and ViT architectures to improve segmentation performance on small-scale medical image datasets.MethodsIn this study, we established a U-shaped network architecture based on a Transformer-assisted convolutional neural network (TAC-UNet). The TAC-UNet is primarily composed of a hybrid structure integrating CNN and Transformer components. Specifically, the hybrid architecture follows a dual-path design in which the Transformer branch continuously conveys global contextual information to the CNN backbone. This allows the CNN backbone to enhance its global perception while building on the local features it extracts, thereby improving its ability to comprehend complex image structures. A channel cross-attention (CCA) module is also incorporated as a bridge between the encoder and decoder to better reconcile the semantic discrepancies between them.ResultsDetailed experiments on three public datasets were conducted. Specifically, our model was trained on 30 images from the Multi-organ Nucleus Segmentation (MoNuSeg) training dataset, 85 images from the Gland Segmentation (GlaS) training dataset, and 551 images from the Computer Vision Center Colorectal Cancer-Clinic Database (CVC-ClinicDB) dataset. We evaluated the performance of our model on the corresponding test sets. Our TAC-UNet achieved the best Dice scores (80.36%, 90.70%, and 91.81% on the MoNuSeg, GlaS, and CVC-ClinicDB datasets, respectively) of all the models. Compared to other CNN-based, Transformer-based, and hybrid methods, the TAC-UNet demonstrated significantly superior segmentation performance.ConclusionsOur TAC-UNet model showed advanced segmentation performance on small-scale medical image datasets. The detailed experimental results showed the effectiveness of the method. Our model's code is available at: https://github.com/hejlhello/TAC-UNet.

Project description:Transformers have demonstrated significant promise for computer vision tasks. Particularly noteworthy is SwinUNETR, a model that employs vision transformers, which has made remarkable advancements in improving the process of segmenting medical images. Nevertheless, the efficacy of training process of SwinUNETR has been constrained by an extended training duration, a limitation primarily attributable to the integration of the attention mechanism within the architecture. In this article, to address this limitation, we introduce a novel framework, called the MetaSwin model. Drawing inspiration from the MetaFormer concept that uses other token mix operations, we propose a transformative modification by substituting attention-based components within SwinUNETR with a straightforward yet impactful spatial pooling operation. Additionally, we incorporate of Squeeze-and-Excitation (SE) blocks after each MetaSwin block of the encoder and into the decoder, which aims at segmentation performance. We evaluate our proposed MetaSwin model on two distinct medical datasets, namely BraTS 2023 and MICCAI 2015 BTCV, and conduct a comprehensive comparison with the two baselines, i.e., SwinUNETR and SwinUNETR+SE models. Our results emphasize the effectiveness of MetaSwin, showcasing its competitive edge against the baselines, utilizing a simple pooling operation and efficient SE blocks. MetaSwin's consistent and superior performance on the BTCV dataset, in comparison to SwinUNETR, is particularly significant. For instance, with a model size of 24, MetaSwin outperforms SwinUNETR's 76.58% Dice score using fewer parameters (15,407,384 vs 15,703,304) and a substantially reduced training time (300 vs 467 mins), achieving an improved Dice score of 79.12%. This research highlights the essential contribution of a simplified transformer framework, incorporating basic elements such as pooling and SE blocks, thus emphasizing their potential to guide the progression of medical segmentation models, without relying on complex attention-based mechanisms.

Dataset Information

High-Resolution Swin Transformer for Automatic Medical Image Segmentation.

Publications

High-Resolution Swin Transformer for Automatic Medical Image Segmentation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets