Visual Modality Prompt for Adapting Vision-Language Object Detectors
Heitor R. Medeiros
Atif Belal
Srikanth Muralidharan
Eric Granger
Marco Pedersoli
Under Review
[GitHub]
[Paper]



Strategies to adapt object detectors to new modalities: (a) Full Fine-tuning: Both the backbone (the part of the model responsible for feature extraction) and the head (responsible for the final output, like object detection) are updated with new training data. (b) Head Fine-tuning: Only the head is fine-tuned while the backbone remains frozen. (c) Visual Prompt: Uses a visual prompt added to the input. The backbone and head remain unchanged, but the visual prompt guides the model to better interpret the new modality. (d) Our Modality Prompt. Similarly to a visual prompt, the input image is added to a visual prompt. The main difference is that here the prompt is not static, it is a transformation of the input image.



Abstract

The zero-shot (ZS) performance of object detectors (OD) degrades when tested on different modalities, such as IR and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language (VL) detectors, such as YOLO-World and Grounding DINO, have shown promising ZS capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning (FT) approaches tend to compromise the ZS capabilities of ODs. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt VL detectors to new modalities without degrading ZS performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly task residuals, facilitating more robust adaptation. Empirically, we benchmark our method for modality adaptation on two VL detectors, YOLO-World and Grounding DINO, and on challenging IR (LLVIP, FLIR) and depth (NYUv2), achieving performance comparable to FT while preserving the model's ZS capability. Our code is available at: https://github.com/heitorrapela/ModPrompt


Try our code



Our proposed strategy for text-prompt tuning: an inference-friendly and knowledge-preserving text-prompt tuning method. An offline embedding is generated for each object category, then residual trainable parameters and the ModPrompt are integrated into the detector for adapting it to new modalities.


Paper and Supplementary Material

Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, Marco Pedersoli.

ModPrompt: Visual Modality Prompt for Adapting Vision-Language Object Detectors.
In ArXiv, 2024.

(hosted on ArXiv2024)

[Bibtex]

Experiments and Results

- Visual Modality Adaptation.



Table 1. Detection performance (APs) for YOLO-World and Grounding DINO for the three main datasets evaluated: LLVIP-IR, FLIR-IR, and NYUv2-Depth. The different visual prompt adaptation techniques are compared with our ModPrompt, and the zero-shot (ZS), head finetuning (HFT), and full finetuning (FT) are also reported, where the full finetuning is the upper bound.

- Ablation of Visual Prompts.



Table 2. Table 2. Detection performance (APs) for YOLO-World under the three main datasets evaluated: LLVIP-IR, FLIR-IR, and NYUv2-Depth. We compared the main visual prompt strategies fixed, random, padding, and ModPrompt. The variations consist of the number of prompt pixels (ps = 30, 200 or 300) and for ModPrompt, the MobileNet (MB) or ResNet (RES).

- Comparison with Modality Translators for OD.



Table 3. Detection performance of different modality translators for OD in terms of APs.

- Inference-friendly Text-embedding Adaptation with Knowledge Preservation.



Table 4. Detection performance (APs) for YOLO-World and Grounding DINO on FLIR-IR and NYUv2-Depth datasets. Each visual prompt adaptation strategy is compared with the learnable task residuals (results in parenthesis are the difference without the task residuals), which are responsible for updating the new task embeddings and not changing the text encoder knowledge.

- Qualitative Results.



Figure 4. Detections for YOLO-World for the different approaches: First two rows for LLVIP (infrared), and last two rows for NYUv2 (depth). Each column corresponds to a different approach: (a) GT (Ground Truth): Shows in yellow the ground-truth bounding boxes for objects. (b) Zero-Shot: Displays detections (in red) from a zero-shot model. (c) Visual Prompt: Illustrates detections with a visual prompt added to the image. (d) ModPrompt (Ours): Detections from our proposed model.






Acknowledgements

This work was supported in part by Distech Controls Inc., the Natural Sciences and Engineering Research Council of Canada, the Digital Research Alliance of Canada, and MITACS.