Abstract

In real-world scenarios, using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD). Multimodal learning is a common way to leverage these modalities, where multiple modality-specific encoders and a fusion module are used to improve performance. In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This realistic setting requires a lower memory footprint and is more suitable for applications such as autonomous driving and surveillance, which commonly rely on RGB and IR data. However, when learning a single encoder on multiple modalities, one modality can dominate the other, producing uneven recognition results. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance. For this, we introduce a novel training technique to Mix Patches (MiPa) from the two modalities, in conjunction with a patch-wise modality agnostic module, for learning a common representation of both modalities. Our experiments show that MiPa can learn a representation to reach competitive results on traditional RGB/IR benchmarks while only requiring a single modality during inference.

Try our code

Mixed Patches (MiPa) with Modality Agnostic (MA) module. In yellow is the patchify function. In purple is the MiPa module, followed by the feature extractor (encoder). In green is the modality classifier, and in pink is the detection head.

Paper and Supplementary Material

Heitor R. Medeiros, David Latortue, Eric Granger, Marco Pedersoli

Mixed Patch Visible-Infrared Modality Agnostic Object Detection.
In WACV, 2025.

(hosted on WACV2025)

[Bibtex]

Experiments and Results

- Towards the optimal (ρ).

Table 2: Comparison of different ratio (ρ) sampling methods on LLVIP. Using DINO with SWIN backbone.

- Patch-wise Modality Agnostic Training.

Table 3: Comparison of detection performance over different baselines and MiPa for different models on SWIN backbone for DINO and Deformable DETR. The evaluation is done for RGB, IR, and the average of the modalities.

- Ablation on MA.

Table 4: MiPa ablation on γ and comparison with different baselines for DINO SWIN. The evaluation is done for RGB, IR, and the average of the modalities in terms of AP50 performance.

- Comparison with different RGB/IR Competitors.

Table 5: Comparison with different multimodal works on RGB/IR benchmarks.

- Qualitative results.

Figure 3: Detection over different methods for two different daytimes: Night and Day and two different modalities: RGD and IR. Detectors trained on RGB work better in the daytime. Detectors trained on IR work better at nighttime. Detectors trained on Both modalities in a naive way cannot work only on the dominant modality. Our MiPa manages to work well in all conditions.

Acknowledgements

This work was supported in part by Distech Controls Inc., the Natural Sciences and Engineering Research Council of Canada, the Digital Research Alliance of Canada, and MITACS.