SlideShare a Scribd company logo
Transformers For Vision
Lecture 5
Outline
● Background on Transformer models
● Transformers for image classification
● [Admin interlude]
● Perceiver models [guest talk from Drew Jaegle, DeepMind]
Alexey Dosovitskiy
Transformers
for Computer Vision
EEML summer school
July 7th 2021, Budapest (virtually)
AlexNet
● AlexNet (2012) - first big success of deep learning in vision*
* ConvNets had previously shown good results on specialized dataset like handwritten digits (LeCun et al.) or traffic signs
(Ciersan et al.), but not on large and diverse “natural” datasets
Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
ResNet
● ResNet (2015) - make deep models train well by adding residual connections
Kaiming He et al., Deep Residual Learning for Image Recognition, CVPR 2016
Transformer
● Non-vision specific model
○ Typically applied to 1-D sequence data
● Transformer “encoder”
○ A stack of alternating self-attention and MLP blocks
○ Residuals and LayerNorm
Vaswani et al., Attention Is All You Need, NeurIPS 2017
Figure from Vaswani et al.
● Transformer “decoder” (not shown)
○ A slightly more involved architecture useful when the
output space is different from the input space (e.g.
translation)
Self-attention
● Each of the tokens (=vectors) attends to all tokens
Vaswani et al., Attention Is All You Need, NeurIPS 2017
Simplified! Multi-headed attention not shown
○ Extra tricks: learned key, query, and value
projections, inverse-sqrt scaling in the softmax, and
multi-headed attention (omit for simplicity)
● It’s a set operation (permutation-invariant)
○ ...and hence need “position embeddings” to
“remember” the spatial structure
● It’s a global operation
○ Aggregates information from all tokens
Self-Attention with Queries, Keys, Values
Make three version of each input embedding x(i)
[Julia Hockenmaier, Lecture 9, UIUC]
Transformer self-attention layer
Input: (matrix of n embedding vectors, each dim m)
Parameters (learned):
Compute:
n
Transformer self-attention
Self-attention explicitly models interactions between all pairs of input embeddings
n x n attention map (each row-sum = 1)
Output matrix H =
[Julia Hockenmaier, Lecture 9, UIUC]
Multi-Head attention
Positional Encoding (1-D)
How to capture sequence order?
Add positional embeddings to input embeddings
- Same dimension
- Can be learned or fixed
Fixed encoding: sin / cos of different frequencies:
ConvNet vs Transformer
Convolution
Convolution
Convolution
Convolution
Input
ConvNet
...
Convolutions (with kernels > 1x1) mix both the
channels and the spatial locations
spatial
channels
MLP
Self-attention
MLP
Self-attention
Transformer (encoder)
...
MLPs (=1x1 convs) only mix the channels, per location
Self-attention mixes the spatial locations (and channels a bit)
Input
spatial
channels
*ResNets have grouped of 1x1 convolutions that are nearly identical to transformer’s MLPs
Trained
end-to-end
BERT model in NLP
● Transformers pre-trained self-supervised perform great on many NLP tasks
○ Masked language modeling (MLM)
○ Next sentence prediction (NSP)
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv 2018
Figure from Devlin et al.
Masked language modeling with Transformers (in NLP)
M
M
Training: Predict Masked Tokens
Mask
MLP
x
N
(mask 15% at a time)
Input
Sentence
Output
Predictions
T5, GPT-3
● T5 (Text-to-Text Transfer Transformer)
○ Formulate many NLP tasks as text-to-text
○ Pre-train a large transformer BERT-style and show that it transfers really well
Raffel et al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, JMLR 2020
Brown et al., Language Models are Few-Shot Learners, NeurIPS 2020
Large-scale self-supervised pre-training “solved”* NLP
*at least made some really impressive progress
● GPT-3 (Generative Pre-Training)
○ Same basic approach, but generative pre-training and even larger model
○ Zero-shot transfer to many tasks: no need for fine-tuning!
Transformers for image classification
Transformers for vision?
● “LSTM → Transformer” ~ “ConvNet → ??? ”
● Issue with self-attention for vision: computation is quadratic in the input
sequence length, quickly gets very expensive (with > few thousand tokens)
○ For ImageNet: 224x224 pixels → ~50,000 sequence length
○ Even worse for higher resolution and video
How can we deal with this quadratic complexity?
Local Self-Attention
Zhao et al., Exploring Self-attention for Image Recognition, CVPR 2020
Ramachandran et al., Stand-Alone Self-Attention in Vision Models, NeurIPS 2019
Hu et al., Local Relation Networks for Image Recognition, ICCV 2019
Convolution Local self-attention
Idea: Make self-attention local, use it instead of convs in a ResNet
Figures from Ramachandran et al.
Axial Self-Attention
Wang et al., Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation, ECCV 2020
Axial attention block
Idea: Make self-attention 1D (a.k.a. axial), use it instead of convs
Figure from Wang et al.
Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
Cordonnier et al., On the Relationship between Self-Attention and Convolutional Layers, ICLR 2020
Idea: Take a
transformer and apply
it directly to image
patches
Vision Transformer (ViT)
ResNet-ViT Hybrid
Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
Bichen Wu et al. Visual Transformers: Token-based Image Representation and Processing for Computer Vision, arXiv 2020
Conclusion: Learns intuitive local structures, but also deviates from locality in interesting ways
Analysis: Learned Position Embeddings
Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
Conclusion: Initial layers are partially local, deeper layers are global
Analysis: “Receptive Field Size”
Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
Scaling with Data
ViT overfits on ImageNet, but
shines on larger datasets
Key
ViT = Vision Transformer
BiT = Big Transfer (~ResNet)
Touvron et al., Training data-efficient image transformers & distillation through attention, arXiv 2020
Xiangning Chen et al., When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations, arXiv 2021
Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
* with heavy regularization ViT has
been shown to also work on
ImageNet (Touvron et al.)
** training ViT on ImageNet with the
sharpness-aware minimizer (SAM)
also works very well (Chen et al.)
Scaling with Compute
Given sufficient data, ViT gives
good performance/FLOP
Hybrids yield benefits only for
smaller models
Pre-training compute (ExaFLOPs)
Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
Power-law
behaviour
E=aC-b
Saturates before y=0
Small models can’t make
use of >30M examples
Even 300M examples
are insufficient for
large models
How many images do you need for a big model & vice-versa?
Scaling Laws
Xiaohua Zhai et al., Scaling Vision Transformers, arXiv 2021
ViT-G/14: 2B params, 3B
images
84.86% top-1 accuracy on 10-shot
ImageNet (<1% of train set) ImageNet SOTA ImageNet (OOD/re-label) variants 19 diverse tasks
Xiaohua Zhai et al., Scaling Vision Transformers, arXiv 2021
How many images do you need for a big model & vice-versa?
Scaling Laws
paperswithcode.com
Summary
Transformer model:
- Alternating layers of self-attention & MLP
- Very few assumptions built into model
- Trained end-to-end
- Easy to scale to be very wide & deep
- Originally applied to NLP (sequences of words)
- Lots of variants in architecture & application
Transformers in vision:
- How to represent image pixels?
- Too many, given quadratic scaling of model
- Position in 2D array
- Below SOTA for small models/data (Convnet/Resnets superior)
- SOTA at very large scale (100M-1B images)
Admin Interlude
HPC situation:
- Everyone should now have an HPC account
- Come and see me after if not!
HPC staff have setup GCP account that we can use through Greene login
- Class TAs will hold session to explain this
Projects
- Time to start on projects
- Google doc with some ideas posted in Piazza
- Will be adding more ideas
- Feel free to come up with your own
- Teams of 2 or 3 people (no teams of 1)
- Every team must chat with me about their proposed idea
- I will tell you if it is feasible/realistic or not.

More Related Content

PDF
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
PPTX
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
Bruno769908
 
PDF
Visual Transformers
Kwanghee Choi
 
PDF
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
PPTX
Presentation vision transformersppt.pptx
htn540
 
PPTX
Mnist report ppt
RaghunandanJairam
 
PPTX
Reading_0413_var_Transformers.pptx
congtran88
 
PDF
物件偵測與辨識技術
CHENHuiMei
 
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
Bruno769908
 
Visual Transformers
Kwanghee Choi
 
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
Presentation vision transformersppt.pptx
htn540
 
Mnist report ppt
RaghunandanJairam
 
Reading_0413_var_Transformers.pptx
congtran88
 
物件偵測與辨識技術
CHENHuiMei
 

Similar to BriefHistoryTransformerstransformers.pdf (20)

PDF
Mnist report
RaghunandanJairam
 
PDF
IPT.pdf
Manas Das
 
PDF
Scene understanding
Mohammed Shoaib
 
PDF
Mirko Lucchese - Deep Image Processing
MeetupDataScienceRoma
 
PPTX
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
PPTX
Transformer in Vision
Sangmin Woo
 
PPTX
Image captioning
Muhammad Zbeedat
 
PPTX
Transformer_BERT_ViT_Swin_Presentation.pptx
bhaveshagrawal35
 
PDF
210523 swin transformer v1.5
taeseon ryu
 
PDF
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
PDF
"Demystifying Deep Neural Networks," a Presentation from BDTI
Edge AI and Vision Alliance
 
PDF
How is a Vision Transformer (ViT) model built and implemented?
Benjaminlapid1
 
PDF
最近の研究情勢についていくために - Deep Learningを中心に -
Hiroshi Fukui
 
PDF
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
PDF
Transformer models for FER
IRJET Journal
 
PPTX
Image Segmentation Using Deep Learning : A survey
NUPUR YADAV
 
PDF
How to use transfer learning to bootstrap image classification and question a...
Wee Hyong Tok
 
PPTX
Convolutional neural networks 이론과 응용
홍배 김
 
PDF
Log polar coordinates
Oğul Göçmen
 
Mnist report
RaghunandanJairam
 
IPT.pdf
Manas Das
 
Scene understanding
Mohammed Shoaib
 
Mirko Lucchese - Deep Image Processing
MeetupDataScienceRoma
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Transformer in Vision
Sangmin Woo
 
Image captioning
Muhammad Zbeedat
 
Transformer_BERT_ViT_Swin_Presentation.pptx
bhaveshagrawal35
 
210523 swin transformer v1.5
taeseon ryu
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
"Demystifying Deep Neural Networks," a Presentation from BDTI
Edge AI and Vision Alliance
 
How is a Vision Transformer (ViT) model built and implemented?
Benjaminlapid1
 
最近の研究情勢についていくために - Deep Learningを中心に -
Hiroshi Fukui
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
Transformer models for FER
IRJET Journal
 
Image Segmentation Using Deep Learning : A survey
NUPUR YADAV
 
How to use transfer learning to bootstrap image classification and question a...
Wee Hyong Tok
 
Convolutional neural networks 이론과 응용
홍배 김
 
Log polar coordinates
Oğul Göçmen
 
Ad

Recently uploaded (20)

PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PDF
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPT
Ppt for engineering students application on field effect
lakshmi.ec
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PDF
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
PPTX
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
Inventory management chapter in automation and robotics.
atisht0104
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Ppt for engineering students application on field effect
lakshmi.ec
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
Ad

BriefHistoryTransformerstransformers.pdf

  • 2. Outline ● Background on Transformer models ● Transformers for image classification ● [Admin interlude] ● Perceiver models [guest talk from Drew Jaegle, DeepMind]
  • 3. Alexey Dosovitskiy Transformers for Computer Vision EEML summer school July 7th 2021, Budapest (virtually)
  • 4. AlexNet ● AlexNet (2012) - first big success of deep learning in vision* * ConvNets had previously shown good results on specialized dataset like handwritten digits (LeCun et al.) or traffic signs (Ciersan et al.), but not on large and diverse “natural” datasets Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
  • 5. ResNet ● ResNet (2015) - make deep models train well by adding residual connections Kaiming He et al., Deep Residual Learning for Image Recognition, CVPR 2016
  • 6. Transformer ● Non-vision specific model ○ Typically applied to 1-D sequence data ● Transformer “encoder” ○ A stack of alternating self-attention and MLP blocks ○ Residuals and LayerNorm Vaswani et al., Attention Is All You Need, NeurIPS 2017 Figure from Vaswani et al. ● Transformer “decoder” (not shown) ○ A slightly more involved architecture useful when the output space is different from the input space (e.g. translation)
  • 7. Self-attention ● Each of the tokens (=vectors) attends to all tokens Vaswani et al., Attention Is All You Need, NeurIPS 2017 Simplified! Multi-headed attention not shown ○ Extra tricks: learned key, query, and value projections, inverse-sqrt scaling in the softmax, and multi-headed attention (omit for simplicity) ● It’s a set operation (permutation-invariant) ○ ...and hence need “position embeddings” to “remember” the spatial structure ● It’s a global operation ○ Aggregates information from all tokens
  • 8. Self-Attention with Queries, Keys, Values Make three version of each input embedding x(i) [Julia Hockenmaier, Lecture 9, UIUC]
  • 9. Transformer self-attention layer Input: (matrix of n embedding vectors, each dim m) Parameters (learned): Compute: n
  • 10. Transformer self-attention Self-attention explicitly models interactions between all pairs of input embeddings n x n attention map (each row-sum = 1) Output matrix H =
  • 11. [Julia Hockenmaier, Lecture 9, UIUC] Multi-Head attention
  • 12. Positional Encoding (1-D) How to capture sequence order? Add positional embeddings to input embeddings - Same dimension - Can be learned or fixed Fixed encoding: sin / cos of different frequencies:
  • 13. ConvNet vs Transformer Convolution Convolution Convolution Convolution Input ConvNet ... Convolutions (with kernels > 1x1) mix both the channels and the spatial locations spatial channels MLP Self-attention MLP Self-attention Transformer (encoder) ... MLPs (=1x1 convs) only mix the channels, per location Self-attention mixes the spatial locations (and channels a bit) Input spatial channels *ResNets have grouped of 1x1 convolutions that are nearly identical to transformer’s MLPs Trained end-to-end
  • 14. BERT model in NLP ● Transformers pre-trained self-supervised perform great on many NLP tasks ○ Masked language modeling (MLM) ○ Next sentence prediction (NSP) Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv 2018 Figure from Devlin et al.
  • 15. Masked language modeling with Transformers (in NLP) M M Training: Predict Masked Tokens Mask MLP x N (mask 15% at a time) Input Sentence Output Predictions
  • 16. T5, GPT-3 ● T5 (Text-to-Text Transfer Transformer) ○ Formulate many NLP tasks as text-to-text ○ Pre-train a large transformer BERT-style and show that it transfers really well Raffel et al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, JMLR 2020 Brown et al., Language Models are Few-Shot Learners, NeurIPS 2020 Large-scale self-supervised pre-training “solved”* NLP *at least made some really impressive progress ● GPT-3 (Generative Pre-Training) ○ Same basic approach, but generative pre-training and even larger model ○ Zero-shot transfer to many tasks: no need for fine-tuning!
  • 17. Transformers for image classification
  • 18. Transformers for vision? ● “LSTM → Transformer” ~ “ConvNet → ??? ” ● Issue with self-attention for vision: computation is quadratic in the input sequence length, quickly gets very expensive (with > few thousand tokens) ○ For ImageNet: 224x224 pixels → ~50,000 sequence length ○ Even worse for higher resolution and video How can we deal with this quadratic complexity?
  • 19. Local Self-Attention Zhao et al., Exploring Self-attention for Image Recognition, CVPR 2020 Ramachandran et al., Stand-Alone Self-Attention in Vision Models, NeurIPS 2019 Hu et al., Local Relation Networks for Image Recognition, ICCV 2019 Convolution Local self-attention Idea: Make self-attention local, use it instead of convs in a ResNet Figures from Ramachandran et al.
  • 20. Axial Self-Attention Wang et al., Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation, ECCV 2020 Axial attention block Idea: Make self-attention 1D (a.k.a. axial), use it instead of convs Figure from Wang et al.
  • 21. Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021 Cordonnier et al., On the Relationship between Self-Attention and Convolutional Layers, ICLR 2020 Idea: Take a transformer and apply it directly to image patches Vision Transformer (ViT)
  • 22. ResNet-ViT Hybrid Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021 Bichen Wu et al. Visual Transformers: Token-based Image Representation and Processing for Computer Vision, arXiv 2020
  • 23. Conclusion: Learns intuitive local structures, but also deviates from locality in interesting ways Analysis: Learned Position Embeddings Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
  • 24. Conclusion: Initial layers are partially local, deeper layers are global Analysis: “Receptive Field Size” Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
  • 25. Scaling with Data ViT overfits on ImageNet, but shines on larger datasets Key ViT = Vision Transformer BiT = Big Transfer (~ResNet) Touvron et al., Training data-efficient image transformers & distillation through attention, arXiv 2020 Xiangning Chen et al., When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations, arXiv 2021 Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021 * with heavy regularization ViT has been shown to also work on ImageNet (Touvron et al.) ** training ViT on ImageNet with the sharpness-aware minimizer (SAM) also works very well (Chen et al.)
  • 26. Scaling with Compute Given sufficient data, ViT gives good performance/FLOP Hybrids yield benefits only for smaller models Pre-training compute (ExaFLOPs) Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
  • 27. Power-law behaviour E=aC-b Saturates before y=0 Small models can’t make use of >30M examples Even 300M examples are insufficient for large models How many images do you need for a big model & vice-versa? Scaling Laws Xiaohua Zhai et al., Scaling Vision Transformers, arXiv 2021
  • 28. ViT-G/14: 2B params, 3B images 84.86% top-1 accuracy on 10-shot ImageNet (<1% of train set) ImageNet SOTA ImageNet (OOD/re-label) variants 19 diverse tasks Xiaohua Zhai et al., Scaling Vision Transformers, arXiv 2021 How many images do you need for a big model & vice-versa? Scaling Laws
  • 30. Summary Transformer model: - Alternating layers of self-attention & MLP - Very few assumptions built into model - Trained end-to-end - Easy to scale to be very wide & deep - Originally applied to NLP (sequences of words) - Lots of variants in architecture & application Transformers in vision: - How to represent image pixels? - Too many, given quadratic scaling of model - Position in 2D array - Below SOTA for small models/data (Convnet/Resnets superior) - SOTA at very large scale (100M-1B images)
  • 31. Admin Interlude HPC situation: - Everyone should now have an HPC account - Come and see me after if not! HPC staff have setup GCP account that we can use through Greene login - Class TAs will hold session to explain this Projects - Time to start on projects - Google doc with some ideas posted in Piazza - Will be adding more ideas - Feel free to come up with your own - Teams of 2 or 3 people (no teams of 1) - Every team must chat with me about their proposed idea - I will tell you if it is feasible/realistic or not.