Synergy vs. Noise: Performance-Guided Multimodal Fusion For Biochemical Recurrence-Free Survival in Prostate Cancer

Abstract

Multimodal deep learning (MDL) has emerged as a transformative approach in computational pathology. By integrating complementary information from multiple data sources, MDL models have demonstrated superior predictive performance across diverse clinical tasks compared to unimodal models. However, the assumption that combining modalities inherently improves performance remains largely unexamined. We hypothesise that multimodal gains depend critically on the predictive quality of individual modalities, and that integrating weak modalities may introduce noise rather than complementary information. We test this hypothesis on a prostate cancer dataset with histopathology, radiology, and clinical data to predict time-to-biochemical recurrence. Our results confirm that combining high-performing modalities yield superior performance compared to unimodal approaches. However, integrating a poor-performing modality with other higher-performing modalities degrades predictive accuracy. These findings demonstrate that multimodal benefit requires selective, performance-guided integration rather than indiscriminate modality combination, with implications for MDL design across computational pathology and medical imaging.

1 Introduction

Multimodal deep learning (MDL) integrates heterogeneous data sources, such as whole-slide histopathology images, molecular profiles, imaging data, and clinical variables for clinical tasks. MDL models capture complementary patterns across modalities that improve prognostic accuracy across various tasks [MDL_summary]. This has led to mean increases in area under the receiver operating characteristic curve (AUROC) of 6.4% over their single-modality counterparts [MDL_AUC], with certain MDL models reaching clinical-grade performance on par with human experts [MDL_human_experts]. This multimodal integration has enabled advances across diverse tasks, including disease subtyping, survival prediction, and biomarker discovery [MDL_diverse_tasks]. These advances are paving the way for clinical-grade AI systems that approach human expert-level performance, positioning MDL as a critical tool for a variety of clinical tasks.

Contemporary fusion methodologies in medical imaging include early, intermediate, and late fusion [MDL_methods]. These strategies operate at distinct stages of the analysis pipeline: early fusion combines raw input data before feature extraction, intermediate fusion integrates features extracted from individual modalities, and late fusion combines predictions from separate unimodal models [fusion_strategies]. A recent systematic review identified intermediate fusion as the most promising approach in computational pathology due to its ability to effectively combine modality-specific features during training [Intermediate_Fusion_Benefits].

Intermediate fusion can be implemented through two primary approaches: marginal intermediate fusion and joint intermediate fusion [MDL_human_experts]. Marginal intermediate fusion involves combining features from different modalities without cross-modal interaction during combination through mechanisms such as element-wise operations (e.g. summation, multiplication), concatenation, or more complex tensor interactions, such as bilinear fusion [IF_bilinear_fusion]. Joint intermediate fusion allows features to interact across modalities during combination, potentially capturing cross-modal dependencies and revealing more nuanced relationships between modalities that may be missed by simple combination methods. Joint intermediate fusion can be implemented in various ways, including architectures with self-attention and cross-attention layers [IF_CA_SA], and mechanisms that dynamically weight modality interactions [IF_weighted_modality].

Despite the success of multimodal fusion in computational pathology, the assumption that integrating additional modalities universally improves performance warrants critical examination. Recent studies in other fields have revealed that multimodal models may, in some cases, perform worse than their unimodal counterparts, particularly when modalities possess limited discriminative information and substantial noise [fusion_challenges]. To explore this, we systematically evaluated how integrating modalities with varying predictive capabilities affects model performance, hypothesising that high-performing modalities enhance performance while poor-performing ones degrade it.

Our main contributions are summarised as follows:

1.

We investigate the hypothesis that integrating a low-performing modality with a high-performing one can degrade predictive accuracy for predicting time-to-biochemical recurrence (BCR) in prostate cancer.
2.

We show that fusing high-performing modalities yields a synergistic benefit in prognostic performance, while indiscriminate fusion with weak modalities can be detrimental.
3.

We advocate a performance-guided, selective approach to multimodal fusion in computational pathology, challenging the assumption that more modalities always help.
4.

We implement a joint intermediate fusion architecture with cross-attention and self-attention to capture inter- and intra-modal dependencies, achieving superior accuracy.

2 Materials & Methods

2.1 Study Data

We utilised data from Task 1 of the CHIMERA Challenge [CHIMERA], which integrates histopathology, radiology, and clinical data to predict time-to-BCR in prostate cancer patients following radical prostatectomy. The dataset comprises 95 patient cases, each including:

1.

Histopathology: H&E-stained Whole Slide Images (WSIs) of prostatectomy specimens (up to 12 slides per patient), scanned using a 3DHISTECH PANNORAMIC 1000 (0.5 µm/pixel).
2.
Radiology: Preoperative imaging from Siemens 3T MRI scanners, including:
- •
  
  T2-Weighted (T2W) MRI
- •
  
  Apparent Diffusion Coefficient (ADC) maps
- •
  
  High B-Value (HBV) Diffusion-Weighted Imaging (DWI)
- •
  
  Prostate gland segmentation masks
3.
Clinical Metadata: Patient-specific clinical features and outcome annotations (JSON format):
- •
  
  Demographics: Age at prostatectomy
- •
  
  Grading: Primary/Secondary/Tertiary Gleason, ISUP grade
- •
  
  Biomarkers: PSA (pre-surgery, at recurrence), BCR status
- •
  
  Pathology: pT-staging, invasion markers (LN, CP, SV, LV), surgical margins
- •
  
  History: Earlier therapy, time to last follow-up/BCR

This task involves modelling time-to-BCR as a survival analysis problem. The dataset exhibits right-censoring, where not all patients experienced recurrence within the follow-up period. Specifically, 27 patients (28.4%) had confirmed BCR events, while 68 patients (71.6%) were censored without observed recurrence.

2.2 Approach

Our analysis proceeded in three stages. First, we independently assessed each of the three modalities to quantify their baseline predictive performance for the clinical task. Second, we systematically evaluated all three possible pairwise combinations. Finally, we assessed the performance of a model combining all three modalities.

2.2.1 Model Architecture

All models were implemented as multilayer perceptrons (MLPs) with three hidden layers, ReLU activations, and a normalisation layer. Hyperparameters (layer sizes, dropout rates) were tuned for each configuration. Models were trained using the Adam optimiser and a Cox proportional hazards loss function, and early stopping was applied to prevent overfitting. Performance was evaluated using the concordance index (C-Index) [C_Index] on hold-out test sets comprising 20% of the data within each fold of a 5-fold cross-validation scheme, repeated 10 times (with random seeds) to ensure robust estimates.

2.2.2 Unimodal Models

We conducted a comprehensive evaluation of pretrained foundation models to determine the most effective feature extraction methods for histopathology and radiology data. For histopathology images, we tested different slide-level foundation models (TITAN [TITAN], CHIEF [CHIEF], PRISM [PRISM], MADELEINE [MADELEINE] implemented using the TRIDENT repository [TRIDENT_1][TRIDENT_2]), with random selection employed when multiple slides were available per patient. For radiology images, we leveraged a general-purpose 3D medical imaging encoder, specifically MedicalNet [MedicalNet], applying it to T2-weighted sequences. T2-weighted imaging offers superior contrast for lesion detection and is less affected by motion or diffusion artifacts compared to HBV and ADC sequences. For further comparison against a more interpretable pipeline, we also extracted handcrafted radiomic features (n=120), based on shape, intensity and texture [Radiomics]. For clinical data, numerical and categorical variables were manually processed into feature vectors. Categorical variables were one-hot encoded, with each feature vector component representing either a numerical field or a specific category.

Refer to caption — Fig. 1: Overview of multimodal fusion strategies for predicting biochemical recurrence-free survival in prostate cancer. (a) Marginal intermediate fusion (dotted line): modality-specific features are concatenated without cross-modal interaction before prediction. (b) Joint intermediate fusion: features interact through cross-attention and self-attention layers to capture inter- and intra-modal dependencies prior to prediction.

We adopted intermediate fusion for our multimodal architecture based on theoretical considerations. Intermediate fusion addresses key limitations of alternative approaches for the clinical task of predicting time-to-BCR: early fusion would compromise the feature extraction capabilities of modality-specific foundation models by combining raw images before leveraging their domain-specific training, while late fusion would miss critical cross-modal interactions that may be essential for accurate BCR prediction. Thus, we tested two different fusion approaches: marginal and joint intermediate fusion. In both approaches, features from each modality were first standardised to the same dimension via linear projection to ensure equal representation and prevent any single modality from dominating the fusion.

2.2.3 Marginal Intermediate Fusion

For marginal intermediate fusion, features from each modality were encoded independently, concatenated, and passed to an MLP for prediction (see Fig. 1).

2.2.4 Joint Intermediate Fusion

In our joint intermediate fusion pipeline, encoded features from each modality were processed through successive cross-attention and self-attention layers before prediction (see Fig. 1). Cross-attention facilitates interaction between modalities, enabling features to query each other and establish correspondences before fusion. For example, histopathology features related to cellular proliferation may attend to radiology features capturing enhancement patterns indicative of similar biological processes.

After applying cross-attention across all modalities, self-attention was applied within each modality to model internal dependencies among feature components. This step helps create contextual representations where related elements can influence each other. For instance, histopathology features representing cell density may attend to those representing cell size, reflecting their joint role in cancer progression. Finally, the updated features were concatenated and passed through a predictive MLP.

3 Results and Discussion

3.1 Unimodal Model Performance

Patch Size and Resolution	Model	C-Index
10 $\times$ 256	PRISM [PRISM]	0.7759 ± 0.1088
	CHIEF [CHIEF]	0.7720 ± 0.0944
	TITAN [TITAN]	0.6916 ± 0.1049
	MADELEINE [MADELEINE]	0.7307 ± 0.1038
10 $\times$ 512	PRISM [PRISM]	0.7392 ± 0.1095
	CHIEF [CHIEF]	0.7380 ± 0.1121
	TITAN [TITAN]	0.6906 ± 0.1125
	MADELEINE [MADELEINE]	0.6498 ± 0.1275
20 $\times$ 256	PRISM [PRISM]	0.7939 ± 0.1106
	CHIEF [CHIEF]	0.6717 ± 0.1240
	TITAN [TITAN]	0.7418 ± 0.1101
	MADELEINE [MADELEINE]	0.7065 ± 0.1035
20 $\times$ 512	PRISM [PRISM]	0.7594 ± 0.1100
	CHIEF [CHIEF]	0.6741 ± 0.1140
	TITAN [TITAN]	0.7172 ± 0.1056
	MADELEINE [MADELEINE]	0.7159 ± 0.1007

Table 1: Performance of histopathology foundation models across patch sizes and magnifications, reported as mean C-index

\pm

standard deviation over repeated (10 repeats) 5-fold cross-validation experiments

Model	C-Index
MedicalNet [MedicalNet]	0.5584 ± 0.1250
Radiomics [Radiomics]	0.5508 ± 0.1318

Table 2: Performance of radiology models over repeated (10 repeats) 5-fold cross-validation experiments. We present mean C-index

\pm

standard deviation.

Modality	C-Index
Clinical	0.8037 ± 0.1034
Histopathology	0.7939 ± 0.1106
Radiology	0.5584 ± 0.1250

Table 3: Best results for each modality over repeated (10 repeats) 5-fold cross-validation experiments. We present mean C-index

\pm

standard deviation.

To identify the most effective feature representations for predicting time-to-BCR, we evaluated a range of histopathology and radiology foundation models. For histopathology, we tested multiple patch sizes (256 $\times$ 256 and 512 $\times$ 512 pixels) across magnifications (10 $\times$ , 20 $\times$ ). As shown in Table 1, PRISM (20 $\times$ magnification, 512 $\times$ 512 pixels) achieved the highest perforamce (C-Index = 0.7939). For radiology, MedicalNet yielded the best results (C-Index = 0.5584), though performance remained close to random (Table 2).

Table 3 summarises the best performing models per modality. Clinical data demonstrated the strongest predictive performance (C-Index = 0.8037), followed closely by histopathology (C-Index = 0.7939). Radiology, however, demonstrated limited predictive value (C-Index = 0.5584), with performance approaching the 0.5 baseline expected from uninformative features. Based on these results, we categorised clinical and histopathology as high-performing modalities for this task, and radiology as low-performing.

3.2 Multimodal Model Performance

Table 4 presents the results of multimodal fusion using both marginal and joint fusion strategies. Fusing the two high-performing modalities (histopathology and clinical data) yielded a synergistic improvement (C-Index = 0.8348), outperforming either modality in isolation. Conversely, any fusion including the low-performing radiology modality consistently downgraded performance. For example, combining clinical and radiology resulted in a lower C-Index (C-Index = 0.7593) than clinical data alone. Even when radiology was added to the high-performing pair (i.e. clinical, histopathology and radiology), performance dropped (C-Index = 0.8025).

These findings support our hypothesis that multimodal fusion is beneficial when including informative modalities. In contrast, the addition of weaker modalities introduces noise that undermines predictive accuracy. The results highlight the importance of unimodal evaluation prior to fusion, to guide the selective inclusion of modalities.

Modalities	Fusion Method
Modalities	Marginal	Joint
Histopathology & Clinical	0.8210 ± 0.0980	0.8348 ± 0.0967
Histopathology & Radiology	0.7501 ± 0.1166	0.7502 ± 0.1389
Clinical & Radiology	0.7567 ± 0.1119	0.7593 ± 0.1079
Clinical & Histopathology & Radiology	0.8025 ± 0.0985	0.7996 ± 0.1062

Table 4: Multimodal fusion results comparing marginal/joint strategies over repeated (10 repeats) 5-fold cross-validation experiments. We present mean C-index

\pm

standard deviation.

4 Conclusion and Future Directions

In this study, we investigated the assumption that combining modalities inherently improves performance for computational pathology MDL architectures. We found that integrating high-performing modalities (e.g. clinical data and histopathology) enhances prediction of time-to-BCR, while incorporating low-performing modalities such as radiology degrades performance. This is likely due to the limited ability of preoperative MRI to capture microscopic disease and cellular-level aggressiveness, which histopathology reveals and which ultimately drives recurrence.

We hypothesised that multimodal gains depend on the predictive quality of individual modalities, and our results validate this. Weak modalities introduce noise rather than complementary information, undermining the predictive power of stronger ones. The success of multimodal fusion is therefore contingent on the discriminative quality of the modalities selected. While our findings are promising, the small dataset size (n=95) presents a limitation. This restricts statistical power and increases the risk of overfitting, which may affect generalisability to broader patient populations. Future work should focus on curating larger, more diverse datasets and validating this framework across other clinical prediction tasks to confirm its robustness and applicability.

Despite the dataset size limitation, these findings have direct implications for designing multimodal systems in computational pathology. The assumption that ‘more data is always better’ does not hold. Instead, a preceding unimodal analysis is essential to guide the selective inclusion of informative modalities and prevent low-performing ones from introducing noise. This is a principle we believe to be generalisable across diverse clinical applications, and other multimodal domains.

5 Compliance with Ethical Standards

This research study was conducted retrospectively using open access human subject data from the Chimera Challenge. Ethical approval was not required.

6 Acknowledgments

This work was supported by the Undergraduate Research Support Scheme and the Department of Computer Science at the University of Warwick.