GroundingDINO及其进阶版1.5SAMSAM2的源代码及预训练模型，适用于无法打开Github网页的同学

共7个文件

zip：4个

pdf：2个

pth：1个

1星需积分: 5 9 浏览量 2024-08-13 22:49:10 上传评论 3 收藏 851.96MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Grounding DINO 1.5 SAM SAM2 paper.zip （7个子文件）

groundingdino_swint_ogc.pth 661.85MB

2401.14159v1.pdf 3.75MB

2303.05499v5.pdf 4.13MB

Grounded-Segment-Anything-main.zip 79.03MB

Grounded-SAM-2-main.zip 102.75MB

Grounding-DINO-1.5-API-master.zip 38.97MB

GroundingDINO-main.zip 10.66MB

Grounding DINO: Marrying DINO with Grounded

Pre-Training for Open-Set Object Detection

Shilong Liu

1,2⋆

, Zhaoyang Zeng

, Tianhe Ren

, Feng Li

2, 3

, Hao Zhang

2, 3

Jie Yang

2, 4

, Qing Jiang

2, 6

Chunyuan Li

, Jianwei Yang

Hang Su

, Jun Zhu

1⋆⋆

, Lei Zhang

2⋆⋆

Dept. of Comp. Sci. and Tech., BNRist Center, State Key Lab for Intell. Tech. & Sys.,

Institute for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University

International Digital Economy Academy (IDEA)

The Hong Kong University of Science and Technology

The Chinese University of Hong Kong (Shenzhen)

Microsoft Research, Redmond

South China University of Technology

[email protected], [email protected]

Standard Object Detection

COCO pre-defined categories

Zero-Shot Transfer to

Novel Categories

worldcup

Human-input novel categories

ear, lion, bench The left lion

The bottom man with his head up

Referring Object Detection

(Referring Expression Comprehension)

Human-input reference sentences

bench

person

Collaborate with stable diffusion.

Prompt (modify background): All people

around the world cheer with a worldcup.

Prompt (modify detected objects): Dog

(b) Open-Set Object Detection

Object localization Text understanding

(a) Closed-Set Object Detection

Fig. 1: (a) Closed-set object detection requires models to detect objects of pre-deﬁned

categories. (b) We evaluate models on novel objects and standard Referring expression

comprehension (REC) benchmarks for model generalizations on novel objects with

attributes. (c) We present an image editing application by combining Grounding DINO

and Stable Diﬀusion [41]. Best viewed in colors.

Abstract.

In this paper, we develop an open-set object detector, called

Grounding DINO, by marrying Transformer-based detector DINO with

grounded pre-training, which can detect arbitrary objects with human

inputs such as category names or referring expressions. The key solution of

open-set object detection is introducing language to a closed-set detector

for open-set concept generalization. To eﬀectively fuse language and vision

modalities, we conceptually divide a closed-set detector into three phases

and propose a tight fusion solution, which includes a feature enhancer, a

language-guided query selection, and a cross-modality decoder for modal-

ities fusion. We ﬁrst pre-train Grounding DINO on large-scale datasets,

⋆

This work was done when Shilong Liu, Feng Li, Hao Zhang, Jie Yang, and Qing

Jiang were interns at IDEA.

⋆⋆

Corresponding authors.

arXiv:2303.05499v5 [cs.CV] 19 Jul 2024

2 S. Liu et al.

including object detection data, grounding data, and caption data, and

evaluate the model on both open-set object detection and referring object

detection benchmarks. Grounding DINO performs remarkably well on

all three settings, including benchmarks on COCO, LVIS, ODinW, and

RefCOCO/+/g. Grounding DINO achieves a 52

5 AP on the COCO zero-

shot

detection benchmark. It sets a new record on the ODinW zero-shot

benchmark with a mean 26

1 AP. We release some checkpoints and infer-

ence codes at https://github.com/IDEA-Research/GroundingDINO.

Keywords: Object Detection · Image Grounding · Multi-modal learning

1 Introduction

A key indicator of an Artiﬁcial General Intelligence (AGI) system’s capability is

its proﬁciency in handling open-world scenarios. In this paper, we aim to develop

a strong system to detect arbitrary objects speciﬁed by human language inputs,

a task commonly referred to as open-set object detection

. The task has wide

applications for its great potential as a generic object detector. For example, we

can cooperate with generative models for image editing (as shown in Fig. 1 (b)).

In pursuit of this goal, we design the strong open-set object detector Ground-

ing DINO by following the two principles: tight modality fusion based on

DINO [57] and large-scale grounded pre-train for concept generalization.

Tight modality fusion based on DINO. The key to open-set detection

is introducing language for unseen object generalization [1, 7, 25]. Most existing

open-set detectors are developed by extending closed-set detectors to open-set

scenarios with language information. As shown in Fig. 2, a closed-set detector

typically has three important modules, a backbone for feature extraction, a neck

for feature enhancement, and a head for region reﬁnement (or box prediction).

A closed-set detector can be generalized to detect novel objects by learning

language-aware region embeddings so that each region can be classiﬁed into novel

categories in a language-aware semantic space. The key to achieving this goal is

using contrastive loss between region outputs and language features at the neck

and/or head outputs.

Backbone

(ResNet,Swin,…)

Neck

(DyHead,Encoder,…)

Head

(ROIHead, Decoder, …)

Image

Features

Refined

Image

Features

Query

Init

Output

Regions

Closed-Set

Detector

Text Encoder

Open-Set

Detector

Feature

Fusion A

Contrastive

Loss A

Feature

Fusion B

Feature

Fusion C

Contrastive

Loss B

Text Features

Fig. 2: Extending closed-set detectors to open-set scenarios.

To help a model

align cross-modality

information some work

tried to fuse features

before the ﬁnal loss

stage. We summarize

the modulized design

In this paper, ‘zero-shot’ refers to scenarios where the training split of the test dataset

is not utilized in the training process.

We view the terms open-set object detection, open-world object detection, and open-

vocabulary object detection the same task in this paper. To avoid confusion, we always

use open-set object detection in our paper.

Grounding DINO 3

of object detectors in Fig. 2. Feature fusion can be performed in three phases:

neck (phase A), query initialization (phase B), and head (phase C). For example,

GLIP [25] performs early fusion in the neck module (phase A), and OV-DETR [55]

uses language-aware queries as head inputs (phase B). We argue that introducing

more feature fusion into the pipeline can facilitate better alignment between

diﬀerent modality features, thereby achieving better performance.

Although conceptually simple, it is hard for previous work to perform feature

fusion in all three phases. The design of classical detectors like Faster RCNN

makes it hard to interact with language information in most blocks. Unlike

classical detectors, the Transformer-based detector method such as DINO has a

consistent structure with language blocks. The layer-by-layer design enables it to

interact with language information easily. Under this principle, we design three

feature fusion approaches in the neck, query initialization, and head phases. More

speciﬁcally, we design a feature enhancer by stacking self-attention, text-to-image

cross-attention, and image-to-text cross-attention as the neck module. We then

develop a language-guided query selection method to initialize queries for the

detection head. We also design a cross-modality decoder for the head phase with

image and text cross-attention layers to boost query representations.

Large-scale grounded pre-train for zero-shot transfer. Most existing

open-set models [14, 21] rely on pre-trained CLIP models for concept generaliza-

tion. Nevertheless, the eﬃcacy of CLIP, speciﬁcally pre-trained on image-text

pairs, is limited for region-text pair detection tasks, as identiﬁed in the Region-

CLIP study by RegionCLIP [61]. In contrast, GLIP [25] presents a diﬀerent way

by reformulating object detection as a phrase grounding task and introducing con-

trastive training between object regions and language phrases on large-scale data.

It shows great ﬂexibility for heterogeneous datasets and remarkable performance

on closed-set and open-set detection.

We have adopted and reﬁned the grounded training methodology. GLIP’s

approach involves concatenating all categories into a sentence in a random order.

However, the direct category names concatenation does not consider the potential

inﬂuence of unrelated categories on each other when extracting features. To

mitigate this issue and improve model performance during grounded training, we

introduce a technique that utilizes sub-sentence level text features. It removes the

attention between unrelated categories during word feature extractions. Further

elaboration on this technique can be found in Section 3.4.

We pre-train the Grounding DINO on a large-scale dataset and evaluate the

performance on mainstream object detection benchmarks like COCO [29]. While

some studies have examined open-set detection models under a "partial label"

framework—training on a subset of data (e.g., base categories) and testing on

additional categories—we advocate for a fully zero-shot approach to enhance

practical applicability. Moreover, we extend the model to another important

scenario Referring Expression Comprehension (REC) [30, 34]

, where objects are

described with attributes.

We use the term Referring Expression Comprehension (REC) and Referring (Object)

Detection exchangeable in this paper.

4 S. Liu et al.

We conduct experiments on all three settings, including closed-set detection,

open-set detection, and referring object detection, as shown in Fig. 1, to compre-

hensively evaluate open-set detection performance. Grounding DINO outperforms

competitors by a large margin. For example, Grounding DINO reaches a 52

5 AP

on COCO minival without any COCO training data. It also establishes a new

state of the art on the ODinW [23] zero-shot benchmark with a 26.1 mean AP.

Model

Model Design Text Prompt Closed-Set Settings Zero-Shot Transfer Referring Detection

Base Detector Fusion (Fig. 2) CLIP Represent. Level (Sec. 3.4) COCO COCO LVIS ODinW RefCOCO/+/g

ViLD [14] Mask R-CNN - ✓ sentence ✓ partial label partial label

RegionCLIP [61] Faster RCNN - ✓ sentence ✓ partial label partial label

FindIt [21] Faster RCNN A sentence ✓ partial label ﬁne-tune

MDETR [18] DETR A,C word ﬁne-tune zero-shot ﬁne-tune

DQ-DETR [45] DETR A,C word ✓ zero-shot ﬁne-tune

GLIP [25] DyHead A word ✓ zero-shot zero-shot zero-shot

GLIPv2 [58] DyHead A word ✓ zero-shot zero-shot zero-shot

OV-DETR [55] Deformable DETR B ✓ sentence ✓ partial label partial label

OWL-ViT [35] - - ✓ sentence ✓ partial label partial label zero-shot

DetCLIP [52] ATSS - ✓ sentence zero-shot zero-shot

OmDet [60] Sparse R-CNN C ✓ sentence ✓ zero-shot

Grounding DINO (Ours) DINO A,B,C sub-sentence ✓ zero-shot zero-shot zero-shot zero-shot

Table 1: A comparison of previous open-set object detectors. Our summarization is

based on the experiments in their paper, but not the ability to extend their models to

other tasks. It is worth noting that some related works may not (only) be designed for

the open-set object detection initially, like MDETR [18] and GLIPv2 [58], but we list

them here for a comprehensive comparison with existing work. We use the term “partial

label” for the settings, where models are trained on partial data (e.g. base categories)

and evaluated on other cases. [56]

2 Related Work

Detection Transformers. Grounding DINO is built upon the DETR-like

model DINO [57], which is an end-to-end Transformer-based detector. DETR

was ﬁrst proposed in [2] and then has been improved from many directions

[4, 5, 13, 17, 33, 48, 64] in the past few years. DAB-DETR [31] introduces anchor

boxes as DETR queries for more accurate box prediction. DN-DETR [24] proposes

a query-denoising approach to stabilizing the bipartite matching. DINO [57]

further develops several techniques including contrastive de-noising and sets a

new record on the COCO object detection benchmark. However, such detectors

mainly focus on closed-set detection and are diﬃcult to generalize to novel classes

because of the limited pre-deﬁned categories.

Open-Set Object Detection. Open-set object detection is trained using

existing bounding box annotations and aims at detecting arbitrary classes with

the help of language generalization. OV-DETR [56] uses image and text embedding

encoded by a CLIP model as queries to decode the category-speciﬁed boxes in

the DETR framework [2]. ViLD [14] distills knowledge from a CLIP teacher

model into a R-CNN-like detector so that the learned region embeddings contain

the semantics of language. GLIP [12] formulates object detection as a grounding

problem and leverages additional grounding data to help learn aligned semantics

Grounding DINO 5

A cat sets on a table .

cat . person . mouse .

Image

Backbone

Text

Backbone

Feature Enhancer

Language-guide

Query Selection

Cross-Modality Decoder

1. Model Overall

Input Text

Input Image

Model Outputs

Keys&

Values

Cross-Modality

Queries

Text

Features

Image

Features

Vanilla Text

Features

A Cross-Modality

Decoder Layer

Cross-Modality Query

Self-Attention

Image Cross-Attention

Text Cr oss-Attention

FFN

Updated

Cross-Modality

Query

Text Fe atures

Image Features

3. A Decoder Layer

2. A Feature Enhancer Layer

Self-Attention

Image-to-text Cross-Attention

Text-to-image Cross-Attention

FFN

Deformable

Self-Attention

Image

Features

Text

Features

FFN

Q,K,V

K,V

Q,K,V

K,V

Updated Image

Features

Updated Text

Features

Vanilla Image

Features

Text

Features

Contrastive loss Localization loss

A cat sets on a table .

cat . person . mouse .

Fig. 3: The framework of Grounding DINO. We present the overall framework, a feature

enhancer layer, and a decoder layer in block 1, block 2, and block 3, respectively.

at phrase and region levels. It shows that such a formulation can even achieve

stronger performance on fully-supervised detection benchmarks. DetCLIP [52]

involves large-scale image captioning datasets and uses the generated pseudo

labels to expand the knowledge database. The generated pseudo labels eﬀectively

help extend the generalization ability.

However, previous works only fuse multi-modal information in partial phases,

which may lead to sub-optimal language generalization ability. For example,

GLIP only considers fusion in the feature enhancement (phase A) and OV-DETR

only injects language information at the decoder inputs (phase B). Moreover, the

REC task is normally overlooked in evaluation, which is an important scenario

for open-set detection. We compare our model with other open-set methods in

Table 1.

3 Grounding DINO

Grounding DINO outputs multiple pairs of object boxes and noun phrases for a

given

(Image, Text)

pair. For example, as shown in Fig. 3, the model locates

a cat and a table from the input image and extracts word

cat

and

table

from

the input text as corresponding labels. Both object detection and REC tasks can

be aligned with the pipeline. Following GLIP [25], we concatenate all category

names as input texts for object detection tasks. REC requires a bounding box for

评论收藏

内容反馈

饮筝

2024-10-28

什么dino 1.5源码，你这是源码吗？你这不就是api的代码吗？ #标题与内容不符

第五余歌

粉丝: 14

Grounding DINO 及其进阶版 1.5 SAM SAM2 的源代码及预训练模型，适用于无法打开Github网页的同学

最新资源

Grounding DINO 及其进阶版 1.5 SAM SAM2 的源代码及预训练模型，适用于无法打开Github网页的同学

groundingdino-swint-ogc.pth

windows下编译过的groundingdino-0.1.0-cp39-cp39-win-amd64.whl文件

onnxruntime部署GroundingDINO开放世界目标检测包含C++和Python源码+模型+说明.zip

dino:使用自学式学习方法DINO进行视觉变形金刚训练的PyTorch代码

CV-gronding-dino,tag2text,ram,ram++-原文重点翻译-论文解读

python《基于SAM-DINO-CLIP组合模型实现全景图场景下的地物分类和实例分割》+源码+文档说明（高分作品）

使用onnxruntime部署GroundingDINO开放世界目标检测，包含C++和Python两个版本的程序.zip

a state-of-the-art-level open visual language model - 多模态预训练模型

算法部署-基于TensorRT部署GroundingDINO开集目标检测算法-附项目源码+流程教程-优质项目实战.zip

开放世界目标检测-onnxruntime部署GroundingDINO算法C++源码+onnx模型（含python版本）.zip

windows下编译过的groundingdino-0.1.0-cp38-cp38-win-amd64.whl文件

GroundingDINO 精读笔记

yolo-world资料（源码+文档）

Grounding and Shielding Circuits and Interference

2304.10597.pdf

Grounding and Shielding techniques

打造全场景、跨领域、多模态的AI工作流 - 开源图像标注工具 X-AnyLabeling

机器学习（大模型）：提升大型语言模型（LLMs）在事实准确性和上下文关联（grounding）方面的表现数据集

多模态大模型微调-基于Lora对Qwen-VL多模态大模型进行微调-附项目源码+流程教程-优质项目实战.zip

A Survey on Temporal Sentence Grounding in Videos.pdf

grounding.zip_grounding_neutral current_故障 接地_故障点分析_零序电流 仿真

GROUNDS FOR GROUNDING ---ElyaB.Joffe

Where Does It Exist - Spatio-Temporal Video Grounding for Multi-

grounding_in_PCBdesign_

The Grounding modes of the Micro-grid

groundingdino-0.1.0-cp38-cp38-win_amd64.whl

xiaodianliu.zip_grounding_xiaodianliu_小电流接地_接地仿真

基于SSM+JSP的大学生校园兼职系统的设计与实现

esp32idfst7789写字母

最新资源

grounding.zip_grounding_neutral current_故障接地_故障点分析_零序电流仿真