Foundation Models#

Foundation models are large-scale pretrained models that serve as the backbone for a wide range of computer vision and multi-modal applications. These models are often trained using some form of self-supervised of semi-supervised training algorithms, over large-scale datasets. The main goal of foundational models, is to serve as a starter model that can be adapted to a variety of downstream tasks.

Early examples of foundation models were pretrained language models (LMs) including Google’s BERT and various early GPT (Generative Pre-Trained Transformer) foundation models, which notably includes OpenAI’s “GPT-n” series. Such broad models can, in turn, be used for task and domain specific models using targeted datasets of various kinds. For example, medical codes.

Having started as models primarily for text and language based applications, these foundation models have evolved to support Computer Vision and Multi-Modal applications, such as DALL-E and Flamingo.

TAO v6.0.0 introduces the ability to:

Pretrain or adapt them to your domain, given a corpus of unstructured data or
fine-tune and use them for downstream computer vision tasks, like:
- Image classification
- Object detection
- Semantic segmentation
- Change detection

The NVIDIA trained foundation models that are supported for domain adaptation and downstream finetuning in TAO include:

These models can simplify the use of images in systems by producing all purpose visual features. All purpose visual features work across image distributions and tasks without finetuning.

Trained on large curated datasets, Nvidia’s model has learnt robust fine-grained representation that is useful for localization and classification tasks.

They can be used as foundation models for a variety of downstream tasks with a few labeled examples. For more details on the method see: Dinov2

Image Classification with Foundational Model#

TAO supports finetuning of the following foundational vision encoders from NVIDIA for image classification:

Foundation Model	Classification Head
NvDINOv2	Linear
RADIOv2.5	Linear
C-RADIOv2	Linear

The foundational models provide rich visual representations that can be effectively leveraged for classification tasks through a simple linear head. These models, pretrained on large-scale datasets, can be fine-tuned with minimal labeled data to achieve strong performance on specific classification tasks.

To learn more about using a foundational model as a backbone for an image classification task, refer to the section on Image Classification PyTorch.

TAO also supports finetuning of the vision encoders from multi-modal foundation models for image classification:

The roster of multi-modal foundation models supported include:

NVIDIA CLIP Image Backbones:

Foundation Model	Classification Head
NVCLIP	Linear

OpenAI CLIP Image Backbones:

Arch	Pretrained Dataset	in_channels
ViT-B-32	* laion400m_e31 * laion400m_e32 * laion2b_e16 * laion2b_s34b_b79k * datacomp_m_s128m_b4k * laion2b_s34b_b79k * laion2b_s34b_b79k * laion2b_s34b_b79k * openai	512
ViT-B-16	laion400m_e31	512
ViT-L-14	laion400m_e31	768
ViT-H-14	laion2b_s32b_b79k	1024
ViT-g-14	laion2b_s12b_b42k	1024

EVA - CLIP Image Backbones:

Arch	Pretrained Dataset	in_channels
EVA02-L-14	merged2b_s4b_b131k	768
EVA02-L-14-336	laion400m_e31	768
EVA02-E-14	laion400m_e31	1024
EVA02-E-14-plus	laion2b_s32b_b79k	1024

Object Detection with Foundational Model#

TAO supports finetuning of the following foundational models for object detection:

Foundation Model	Detection Architecture
NvDINOv2	DINO
RADIOv2.5	RT-DETR
C-RADIOv2	RT-DETR

To mitigate the inferiror performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ and Frozen-DETR style architectures with DINO and RT-DETR respectively. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense prediction tasks.

To learn more about using a foundational model as a backbone for an object-detection task, refer to Example Spec File for ViT Backbones.

Semantic Segmentation with Foundational Model#

TAO supports finetuning of the following foundational models for semantic segmentation:

Foundation Model	Segmentation Architecture
NvDINOv2	SegFormer
RADIOv2.5	SegFormer
C-RADIOv2	SegFormer

These foundational models, pretrained on large-scale datasets, provide rich visual representations that can be effectively leveraged for dense prediction tasks through the SegFormer architecture.

To learn more about using a foundational model as a backbone for a semantic segmentation task, refer to Example Spec File for ViT Backbones.

Change Detection with Foundational Model#

TAO supports finetuning of the following foundational models for visual changenet - Classification and Segmentation:

Foundation Model	Detection Architecture
NvDINOv2	Visual ChangeNet
C-RADIOv2	Visual ChangeNet
RADIOv2.5	Visual ChangeNet

To mitigate the inferior performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ architecture. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense preidiction tasks.

To learn more about using a foundational model as a backbone for a change detection task, refer to Visual ChangeNet - Segmentation Example Spec File for ViT Backbones. Visual ChangeNet - Classification Example Spec File for ViT Backbones.