Foundation Models#

Foundation models are large-scale pretrained models that serve as the backbone for a wide range of computer vision and multi-modal applications. These models are often trained using some form of self-supervised of semi-supervised training algorithms, over large-scale datasets. The main goal of foundational models, is to serve as a starter model that can be adapted to a variety of downstream tasks.

Early examples of foundation models were pretrained language models (LMs) including Google’s BERT and various early GPT (Generative Pre-Trained Transformer) foundation models, which notably includes OpenAI’s “GPT-n” series. Such broad models can, in turn, be used for task and domain specific models using targeted datasets of various kinds. For example, medical codes.

Having started as models primarily for text and language based applications, these foundation models have evolved to support Computer Vision and Multi-Modal applications, such as DALL-E and Flamingo.

TAO v6.0.0 introduces the ability to:

  • Pretrain or adapt them to your domain, given a corpus of unstructured data or

  • fine-tune and use them for downstream computer vision tasks, like:

    • Image classification

    • Object detection

    • Semantic segmentation

    • Change detection

The NVIDIA trained foundation models that are supported for domain adaptation and downstream finetuning in TAO include:

These models can simplify the use of images in systems by producing all purpose visual features. All purpose visual features work across image distributions and tasks without finetuning.

Trained on large curated datasets, Nvidia’s model has learnt robust fine-grained representation that is useful for localization and classification tasks.

They can be used as foundation models for a variety of downstream tasks with a few labeled examples. For more details on the method see: Dinov2

Image Classification with Foundational Model#

TAO supports finetuning of the following foundational vision encoders from NVIDIA for image classification:

Foundation Model

Classification Head

NvDINOv2

Linear

RADIOv2.5

Linear

C-RADIOv2

Linear

The foundational models provide rich visual representations that can be effectively leveraged for classification tasks through a simple linear head. These models, pretrained on large-scale datasets, can be fine-tuned with minimal labeled data to achieve strong performance on specific classification tasks.

To learn more about using a foundational model as a backbone for an image classification task, refer to the section on Image Classification PyTorch.

TAO also supports finetuning of the vision encoders from multi-modal foundation models for image classification:

The roster of multi-modal foundation models supported include:

  • NVIDIA CLIP Image Backbones:

Foundation Model

Classification Head

NVCLIP

Linear

  • OpenAI CLIP Image Backbones:

Arch

Pretrained Dataset

in_channels

ViT-B-32








* laion400m_e31
* laion400m_e32
* laion2b_e16
* laion2b_s34b_b79k
* datacomp_m_s128m_b4k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* openai
512








ViT-B-16

laion400m_e31

512

ViT-L-14

laion400m_e31

768

ViT-H-14

laion2b_s32b_b79k

1024

ViT-g-14

laion2b_s12b_b42k

1024

  • EVA - CLIP Image Backbones:

Arch

Pretrained Dataset

in_channels

EVA02-L-14

merged2b_s4b_b131k

768

EVA02-L-14-336

laion400m_e31

768

EVA02-E-14

laion400m_e31

1024

EVA02-E-14-plus

laion2b_s32b_b79k

1024

Object Detection with Foundational Model#

TAO supports finetuning of the following foundational models for object detection:

Foundation Model

Detection Architecture

NvDINOv2

DINO

RADIOv2.5

RT-DETR

C-RADIOv2

RT-DETR

To mitigate the inferiror performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ and Frozen-DETR style architectures with DINO and RT-DETR respectively. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense prediction tasks.

To learn more about using a foundational model as a backbone for an object-detection task, refer to Example Spec File for ViT Backbones.

Semantic Segmentation with Foundational Model#

TAO supports finetuning of the following foundational models for semantic segmentation:

Foundation Model

Segmentation Architecture

NvDINOv2

SegFormer

RADIOv2.5

SegFormer

C-RADIOv2

SegFormer

These foundational models, pretrained on large-scale datasets, provide rich visual representations that can be effectively leveraged for dense prediction tasks through the SegFormer architecture.

To learn more about using a foundational model as a backbone for a semantic segmentation task, refer to Example Spec File for ViT Backbones.

Change Detection with Foundational Model#

TAO supports finetuning of the following foundational models for visual changenet - Classification and Segmentation:

Foundation Model

Detection Architecture

NvDINOv2

Visual ChangeNet

C-RADIOv2

Visual ChangeNet

RADIOv2.5

Visual ChangeNet

To mitigate the inferior performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ architecture. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense preidiction tasks.

To learn more about using a foundational model as a backbone for a change detection task, refer to Visual ChangeNet - Segmentation Example Spec File for ViT Backbones. Visual ChangeNet - Classification Example Spec File for ViT Backbones.