
Grounding DINO: Marrying DINO with Grounded
Pre-Training for Open-Set Object Detection
Shilong Liu
1,2⋆
, Zhaoyang Zeng
2
, Tianhe Ren
2
, Feng Li
2, 3
, Hao Zhang
2, 3
,
Jie Yang
2, 4
, Qing Jiang
2, 6
Chunyuan Li
5
, Jianwei Yang
5
,
Hang Su
1
, Jun Zhu
1⋆⋆
, Lei Zhang
2⋆⋆
.
1
Dept. of Comp. Sci. and Tech., BNRist Center, State Key Lab for Intell. Tech. & Sys.,
Institute for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University
2
International Digital Economy Academy (IDEA)
3
The Hong Kong University of Science and Technology
4
The Chinese University of Hong Kong (Shenzhen)
5
Microsoft Research, Redmond
6
South China University of Technology
Standard Object Detection
COCO pre-defined categories
Zero-Shot Transfer to
Novel Categories
worldcup
Human-input novel categories
ear, lion, bench The left lion
The bottom man with his head up
Referring Object Detection
(Referring Expression Comprehension)
Human-input reference sentences
bench
person
(c) Application: Image Editing
Collaborate with stable diffusion.
Prompt (modify background): All people
around the world cheer with a worldcup.
Prompt (modify detected objects): Dog
(b) Open-Set Object Detection
Object localization Text understanding
(a) Closed-Set Object Detection
Fig. 1: (a) Closed-set object detection requires models to detect objects of pre-defined
categories. (b) We evaluate models on novel objects and standard Referring expression
comprehension (REC) benchmarks for model generalizations on novel objects with
attributes. (c) We present an image editing application by combining Grounding DINO
and Stable Diffusion [41]. Best viewed in colors.
Abstract.
In this paper, we develop an open-set object detector, called
Grounding DINO, by marrying Transformer-based detector DINO with
grounded pre-training, which can detect arbitrary objects with human
inputs such as category names or referring expressions. The key solution of
open-set object detection is introducing language to a closed-set detector
for open-set concept generalization. To effectively fuse language and vision
modalities, we conceptually divide a closed-set detector into three phases
and propose a tight fusion solution, which includes a feature enhancer, a
language-guided query selection, and a cross-modality decoder for modal-
ities fusion. We first pre-train Grounding DINO on large-scale datasets,
⋆
This work was done when Shilong Liu, Feng Li, Hao Zhang, Jie Yang, and Qing
Jiang were interns at IDEA.
⋆⋆
Corresponding authors.
arXiv:2303.05499v5 [cs.CV] 19 Jul 2024