Foundation Models#
Foundation models are large-scale pretrained models that serve as the backbone for a wide range of computer vision and multi-modal applications. These models are often trained using some form of self-supervised of semi-supervised training algorithms, over large-scale datasets. The main goal of foundational models, is to serve as a starter model that can be adapted to a variety of downstream tasks.
Early examples of foundation models were pretrained language models (LMs) including Google’s BERT and various early GPT (Generative Pre-Trained Transformer) foundation models, which notably includes OpenAI’s “GPT-n” series. Such broad models can, in turn, be used for task and domain specific models using targeted datasets of various kinds. For example, medical codes.
Having started as models primarily for text and language based applications, these foundation models have evolved to support Computer Vision and Multi-Modal applications, such as DALL-E and Flamingo.
TAO v6.0.0 introduces the ability to:
Pretrain or adapt them to your domain, given a corpus of unstructured data or
fine-tune and use them for downstream computer vision tasks, like:
Image classification
Object detection
Semantic segmentation
Change detection
The NVIDIA trained foundation models that are supported for domain adaptation and downstream finetuning in TAO include:
These models can simplify the use of images in systems by producing all purpose visual features. All purpose visual features work across image distributions and tasks without finetuning.
Trained on large curated datasets, Nvidia’s model has learnt robust fine-grained representation that is useful for localization and classification tasks.
They can be used as foundation models for a variety of downstream tasks with a few labeled examples. For more details on the method see: Dinov2
Image Classification with Foundational Model#
TAO supports finetuning of the following foundational vision encoders from NVIDIA for image classification:
The foundational models provide rich visual representations that can be effectively leveraged for classification tasks through a simple linear head. These models, pretrained on large-scale datasets, can be fine-tuned with minimal labeled data to achieve strong performance on specific classification tasks.
To learn more about using a foundational model as a backbone for an image classification task, refer to the section on Image Classification PyTorch.
TAO also supports finetuning of the vision encoders from multi-modal foundation models for image classification:
The roster of multi-modal foundation models supported include:
NVIDIA CLIP Image Backbones:
Foundation Model |
Classification Head |
Linear |
OpenAI CLIP Image Backbones:
Arch |
Pretrained Dataset |
in_channels |
ViT-B-32
|
* laion400m_e31
* laion400m_e32
* laion2b_e16
* laion2b_s34b_b79k
* datacomp_m_s128m_b4k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* openai
|
512
|
ViT-B-16 |
laion400m_e31 |
512 |
ViT-L-14 |
laion400m_e31 |
768 |
ViT-H-14 |
laion2b_s32b_b79k |
1024 |
ViT-g-14 |
laion2b_s12b_b42k |
1024 |
EVA - CLIP Image Backbones:
Arch |
Pretrained Dataset |
in_channels |
EVA02-L-14 |
merged2b_s4b_b131k |
768 |
EVA02-L-14-336 |
laion400m_e31 |
768 |
EVA02-E-14 |
laion400m_e31 |
1024 |
EVA02-E-14-plus |
laion2b_s32b_b79k |
1024 |
Object Detection with Foundational Model#
TAO supports finetuning of the following foundational models for object detection:
To mitigate the inferiror performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ and Frozen-DETR style architectures with DINO and RT-DETR respectively. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense prediction tasks.
To learn more about using a foundational model as a backbone for an object-detection task, refer to Example Spec File for ViT Backbones.
Semantic Segmentation with Foundational Model#
TAO supports finetuning of the following foundational models for semantic segmentation:
Foundation Model |
Segmentation Architecture |
SegFormer |
|
SegFormer |
|
SegFormer |
These foundational models, pretrained on large-scale datasets, provide rich visual representations that can be effectively leveraged for dense prediction tasks through the SegFormer architecture.
To learn more about using a foundational model as a backbone for a semantic segmentation task, refer to Example Spec File for ViT Backbones.
Change Detection with Foundational Model#
TAO supports finetuning of the following foundational models for visual changenet - Classification and Segmentation:
Foundation Model |
Detection Architecture |
Visual ChangeNet |
|
Visual ChangeNet |
|
Visual ChangeNet |
To mitigate the inferior performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ architecture. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense preidiction tasks.
To learn more about using a foundational model as a backbone for a change detection task, refer to Visual ChangeNet - Segmentation Example Spec File for ViT Backbones. Visual ChangeNet - Classification Example Spec File for ViT Backbones.