DetPO: In-Context Learning with Multi-Modal LLMs
for Few-Shot Object Detection

Matvei Popov2
Shruti Jain1
John Galeotti1
1Carnegie Mellon University    2Roboflow    *Equal Contribution
DetPO Teaser Figure

Detection Prompt Optimization. A frozen multi-modal LLM (MLLM) is presented with a class name, a textual description, and a few visual examples (left), similar to instructions given to a human annotator. Rather than presenting visual examples directly, we find it far more effective to use them to optimize a better prompt (right) via black-box prompt optimization; we use another MLLM to discover prompt instructions that perform better on the few-shot training dataset.

Abstract

Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks, and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection.

Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%.


Why Does Multi-Modal ICL Hurt Detection?

We benchmark several state-of-the-art MLLMs and find that naively adding few-shot visual examples to the prompt consistently hurts detection accuracy. We posit this arises from rigid prompt structures used during post-training, making it difficult to exploit additional multi-modal contextual information during zero-shot inference.

Method Class Names Instructions Images mAP
Qwen2.5-VL (7B)4.6
Qwen2.5-VL (7B)6.2
Qwen2.5-VL (7B)1.8
Qwen2.5-VL (72B)7.1
Qwen2.5-VL (72B)10.4
Qwen2.5-VL (72B)10.1
Qwen3-VL (8B)10.4
Qwen3-VL (8B)11.4
Qwen3-VL (8B)7.0
Qwen3-VL (30B-A3B)10.7
Qwen3-VL (30B-A3B)11.9
Qwen3-VL (30B-A3B)9.8
Gemini 3 Pro21.9
Gemini 3 Pro23.0
Gemini 3 Pro23.9

Method: Detection Prompt Optimization (DetPO)

DetPO Overview

DetPO Overview. DetPO generates initial class descriptions from ground-truth bounding boxes and iteratively refines them based on model predictions through contrastive prompt refinement. At each step, the model's true positives, false positives, and false negatives on the few-shot training set guide targeted improvements to the class description.

Contrastive Prompt Refinement

Unlike traditional detectors trained primarily on true positive examples, we find that MLLMs achieve higher detection accuracy when provided with corner cases and negative examples, similar to how human annotators learn from examples of what not to annotate. DetPO iteratively refines class descriptions by identifying the worst false positive (highest confidence incorrect detection) and worst false negative (lowest-IoU missed detection), and uses these to instruct the model to explicitly exclude incorrect examples and include missed instances.

Confidence Score Estimation via VQA Score

Current MLLMs often overpredict bounding boxes and lack per-box confidence scores. We find that prompting models to self-report per-box scores improves detection accuracy. Optionally, we post-process detections with VQA Score: for each predicted bounding box overlaid on the test image, we ask "Is this bounding box an instance of class {CLS}?" and use the normalized probability of the Yes token as the final confidence score. This significantly down-weights false positives and improves mAP.

Contrastive Prompt Refinement Example

Contrastive Prompt Refinement Reduces Class Confusion. At each iteration, we use the current class description to query the MLLM on the training set, identify true positives, false positives, and false negatives, and refine the prompt to correct errors. The highlighted text shows newly added details that differentiate Serve from Attack in a volleyball dataset.


Results

Roboflow20-VL Benchmark

We evaluate on Roboflow20-VL (RF20-VL), a few-shot object detection benchmark spanning 20 diverse domains including aerial imagery, X-rays, medical imaging, and wildlife monitoring. Each dataset provides 10-shot training examples and rich annotation instructions per class.

Method Aerial Document Flora & Fauna Industrial Medical Sports Other All
Specialist Models
GroundingDINO (C)28.55.133.712.80.45.116.916.8
LLMDet (C)32.34.433.612.60.76.716.717.2
SAM3 (C)32.315.317.113.82.014.217.816.3
MQ-GLIP (C)30.12.532.85.50.56.410.814.0
YOLO-E (C)10.21.616.48.10.37.810.99.2
Generalist Models
Qwen3-VL (30B-A3B) (C + I)9.07.823.59.60.714.410.111.9
  + GEPA9.312.423.610.81.315.111.313.0
  + MIPROv28.75.618.610.30.015.19.910.7
  + DetPO (Ours) + VQA Score16.125.236.520.10.225.718.421.6
Gemini 3 Pro (C + I + V)27.026.731.326.22.626.913.323.8
  + GEPA19.230.632.132.72.028.220.825.6
  + MIPROv222.927.031.626.43.029.719.825.0
  + DetPO (Ours) + VQA Score26.235.735.423.33.928.220.426.3

DetPO Transfers Across Generalist Models

Method A D F & F I M S O All
Qwen2.5-VL (7B) (C + I)4.95.113.53.90.17.35.06.2
  + DetPO6.212.119.34.20.09.17.59.1
  + DetPO + VQA Score9.417.323.47.20.012.88.811.9
Qwen2.5-VL (72B) (C + I)6.310.719.07.50.414.49.110.4
  + DetPO11.123.026.112.40.514.814.915.7
  + DetPO + VQA Score10.826.326.713.00.516.715.016.5
Qwen3-VL (8B) (C + I)7.17.324.89.30.210.210.911.4
  + DetPO8.319.130.313.70.114.212.215.3
  + DetPO + VQA Score12.324.232.313.50.217.214.317.5
Qwen3-VL (30B-A3B) (C + I)9.07.823.59.60.714.410.111.9
  + DetPO13.818.634.619.70.121.816.419.4
  + DetPO + VQA Score16.125.236.520.10.225.718.421.6
Instruction Type Comparison

Instruction Type Comparison. DetPO optimized prompts consistently improve detection accuracy across nearly all categories compared to baseline and initial prompts.

Iterative Accuracy Improvement

Iterative Accuracy Improvement. Most domains show strong initial gains that plateau around iteration 6, with Flora & Fauna and Aerial showing the largest overall improvements (+2.8 and +2.5, respectively).


Qualitative Results

Qualitative Detection Results

Qualitative Results. The baseline Qwen3-VL model suffers from high false positive rates (dense, overlapping boxes) and poor recall in complex environments. Our proposed method mitigates these issues, significantly reducing erroneous predictions while successfully recovering missed objects (like wheat heads and fish). Best viewed zoomed in.

Detection Confusion Matrix

Detection Confusion Matrix. We compare Qwen3-VL (30B-A3B), Qwen3-VL with DetPO, and with VQA Score across the Actions, Wb-Prova, and Defect Detection datasets. DetPO and VQA Score consistently resolve baseline class imbalances, improving true positive rates for underrepresented subjects (Juvenile, Piglet) and nuanced actions (Defense, Serve).


Prompt Refinement Example

We show how DetPO progressively refines class descriptions. The highlighted text indicates newly added discriminative details:

Original Prompt: Soft plastic is often transparent or semi-transparent, featuring a flexible, wrinkled appearance, and have diverse visual appearances.
DetPO Initial Prompt: 'Soft plastic': Small, flexible, and thin sheets or bags made of translucent plastic material. Characterized by a smooth, shiny surface that reflects light, often with crinkled or folded textures.
DetPO Optimized Prompt: 'Soft plastic': Thin, flexible, and translucent plastic material that appears in crumpled, or folded forms, often as sheets, bags, or loose fragments. Characterized by a smooth, shiny surface that reflects light, with crinkled or wrinkled textures. It may hold contents but lacks rigid structures, doesn't maintain fixed shape, and conforms to surfaces.

Citation

If you find our paper and code useful, please cite us:

@article{gare2026detpo,
  title={DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection},
  author={Gare, Gautam Rajendrakumar and Peri, Neehar and Popov, Matvei and Jain, Shruti and Galeotti, John and Ramanan, Deva},
  journal={arXiv preprint},
  year={2026}
}