2025 Proffered Presentations
S291: DEEP LEARNING CONSENSUS-BASED FRAMEWORK FOR THE ANNOTATION OF A ROUTINE CLINICAL VESTIBULAR SCHWANNOMA MRI DATASET
Navodini Wijethilake1; Marina Ivory1; Oscar MacCormac1; Siddhant Kumar2; Steve Connor3; Soumya Singdha Kundu1; Theodore Barfoot1; Aaron Kujawa1; Tom Vercauteren1; Jonathan Shapey1; 1King's College London; 2The Walton Centre NHS Foundation Trust, Liverpool, United Kingdom; 3King's College Hospital, London, United Kingdom
Introduction: Data annotation is critical for developing machine learning models in medical imaging, where annotation accuracy directly affects model performance. However, obtaining high-quality annotations is costly and requires clinical expertise. Delineating Vestibular Schwannoma (VS) in Magnetic Resonance Imaging (MRI) is particularly challenging due to tumor size variability, patient anatomy, and the heterogeneity of retrospective data, especially when VS coexists with other pathologies like meningioma. Accurate labeling is essential to avoid confounding factors that could hinder model performance.
Methodology: Previously, we used a labor-intensive and costly iterative pipeline to manually annotate heterogeneous scans from multiple institutions, referred to as the multi-center routine clinical (MC-RC) VS dataset (UCLH-MC-RC). In this study, using the UCLH-MC-RC and two additional single-center gamma knife (SC-GK) datasets (LDN-SC-GK, ETZ-SC-GK), we annotated a new MC-RC dataset (KCH-MC-RC). To achieve this, we introduced an iterative pipeline with deep learning-based segmentation to reduce both the annotators' workload and inter-rater variability (Figure 1).
Figure 1. Iterative deep learning consensus-based framework.
We utilised the default 3D full-resolution UNet from the nnU-Net model for segmentation. The initial training dataset, comprising expert-annotated images from three datasets (UCLH-MC-RC, LDN-SC-GK, ETZ-SC-GK) were used to train the model (Table 1). With each round, the model was bootstrapped by incorporating additional cases from the KCH-MC-RC dataset.
Table 1. Distribution of data between training, validation and testing sets used in each round.
In Round 1 of model training, 427 scans were processed and quality assessed by 3 independent experts as shown in Figure 2. A consensus meeting involving a consultant neurosurgeon (J.S.) was subsequently convened to review complex scans.
Figure 2. Data annotation pipeline
After the consensus meeting, accepted KCH-MC-RC cases were combined with the initial training data to enhance the segmentation model through bootstrapping (Table 1). Rejected sessions were then reprocessed using this bootstrapped model. An expert-trained radiologist manually assessed Round 2 annotations; these were accepted or corrected using the ITK-SNAP annotation tool.
In Round 3, accepted and corrected cases from Round 2 were added to the previously accepted cases from Round 1 and combined with the initial training dataset to further refine the model through bootstrapping.
Two independent unseen test datasets were used to evaluate model performance of the bootstrapped models: 1) 50 cases drawn from the UCLH-MC-RC, ETZ-SC-GK, LDN-SC-GK datasets; and 2) 30 cases drawn from the KCH-MC-RC dataset.
Results: Using the bootstrapped models did not improve segmentation results but performance on the KCH-MC-RC validation set improved with each round (Figure 3).
Figure 3. Dice score performance of the model on each round on the Test set and the KCH-MC-RC validation set.
Conclusion: This work demonstrated that iterative bootstrapping was effective in refining the model for the specific characteristics of the KCH-MC-RC dataset. This approach could improve a deep learning segmentation model’s accuracy and adaptability when dealing with complex, heterogeneous medical data.