SlideShare a Scribd company logo
Transforming Deep into
Transformers – A Computer Vision
Approach
Asst.Prof. Dr. Ferdin Joe John Joseph
Data Science and Analytics Laboratory
Faculty of Information Technology
Thai – Nichi Institute of Technology, Bangkok
Keynote Talk to International Conference on Research and Development in Science,Engineering and Technology (ICRDSET 2021)
Concepts
▪ Deep Learning (CNN, RNN)
▪ Transformation to Vision Transformers (ViT)
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
DeepLearning
▪ Convolutional Neural Networks (CNN)
▪ Recurrent Neural Networks (RNN)
▪ Long Short Term Memory (LSTM)
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
Transformation
▪ The era of CNN, RNN and LSTM are coming to an end
▪ End? What Next?
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
Transformer
Architecture
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
Architecture
Decoded
▪ Encoder is at the left and decoder to the right
▪ Encoder takes the input sequence
▪ Understanding encoder is enough for ViT
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
Transformer
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
Encoder
▪ (1) The input data first gets embedded into a vector.The
embedding layer helps us grab a learned vector
representation for each word.
▪ (2) In the next stage a positional encoding is injected
into the input embeddings.This is because a
transformer has no idea about the order of the
sequence that is being passed as input - for example a
sentence.
▪ (3) Now the multi-headed attention is where things get a
little different.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
▪ (4) Multi-Headed Attention consists of three learnable
vectors. Query, Key and Value vectors.The motivation of
this reportedly comes from information retrieval where you
search (query) and the search engine compares your
query with a key and responds with a value.
▪ (5) The Q and K representations undergo a dot product
matrix multiplication to produce a score matrix which
represents how much a word has to attend to every other
word. Higher score means more attention and vice-versa.
▪ (6) Then the Score matrix is scaled down according to the
dimensions of the Q and K vectors.This is to ensure more
stable gradients as multiplication can have exploding
effects.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
▪ (7) Next the Score matrix is softmaxed to turn attention
scores into probabilities. Obviously higher scores are
heightened and lower scores are depressed.This
ensures the model to be confident on which words to
attend to.
▪ (8) Then the resultant matrix with probabilites is
multiplied with the value vector.This will make the
higher probaility scores the model has learned to be
more important.The low scoring words will effectively
drown out to become irrelevant.
▪ (9) Then, the concatenated output of QK and V vectors
are fed into the Linear layer to process further.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
▪ (10) Self-Attention is performed for each word in the
sequence. Since one doesn't depend on the other a
copy of the self attention module can be used to
process everything simultaneously making this multi-
headed.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
Encoder
▪ (11) Then the output value vectors are concatenated
and added to the residual connection coming from the
input layer and then the resultant representation is
passed into a LayerNorm for normalization. (Residual
connection help gradients flow through the network
and LayernNorm helps reduce the training time by a
small fraction and stabilize the network)
▪ (12) Further, the output is passed into a point-wise feed
forward network to obtain an even richer
representation.
▪ (13) The outputs are again Layer-normed and residuals
are added from the previous layer.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
▪ (14) The output from the encoder along with the inputs (if any) from
the previous time steps/words are fed into the decoder where the
outputs undergo masked-multi headed attention before being fed
into the next attention layer along with the output from encoder.
▪ (15) Masked multi headed attention is necessary because the
network shouldn't have any visibility into the words that are to
come later in the sequence while decoding, to ensure there is no
leak.This is done by masking the entries of words that come later
in the series in the Score matrix. Current and previous words in the
sequence are added with 1 and the future word scores are added
with -inf.This ensures the future words in the series get drowned
out into 0 when performing softmax to obtain the probabilities,
while the rest are retained.
▪ (16) There are residual connections here as well, to improve the
flow of gradients. Finally the output is sent to a Linear layer and
softmaxed to obtain the outputs in probabilities.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
NLP
Transformers
▪ Transformer XL
▪ Google’s BERT (Bidirectional Encoder Representations
from Transformers)
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
Transformation
▪ Vision Transformers (ViT)
▪ https://arxiv.org/pdf/2010.11929.pdf
▪ Recently published article (November 2020)
▪ Most of the recent papers in top tier
conferences are using this
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
ViT
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
Vision
Transformer
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
Working
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
▪ (1) They are only using the Encoder part of the
transformer but the difference is in how they are
feeding the images into the network.
▪ (2) They are breaking down the image into fixed size
patches. So one of these patches can be of dimension
16x16 or 32x32 as proposed in the paper. More patches
means more simpler it is to train these networks as the
patches themselves get smaller. Hence we have that in
the title - "An Image is worth 16x16 words".
▪ (3) The patches are then unrolled (flattened) and sent
for further processing into the network.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
▪ (1) They are only using the Encoder part of the
transformer but the difference is in how they are
feeding the images into the network.
▪ (2) They are breaking down the image into fixed size
patches. So one of these patches can be of dimension
16x16 or 32x32 as proposed in the paper. More patches
means more simpler it is to train these networks as the
patches themselves get smaller. Hence we have that in
the title - "An Image is worth 16x16 words".
▪ (3) The patches are then unrolled (flattened) and sent
for further processing into the network.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
▪ (4) Unlike NNs here the model has no idea whatsoever
about the position of the samples in the sequence, here
each sample is a patch from the input image. So the
image is fed along with a positional embedding vector
and into the encoder. One thing to note here is the
positional embeddings are also learnable so you don't
actually feed hard-coded vectors w.r.t to their positions.
▪ (5) There is also a special token at the start just like
BERT in NLP.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
▪ (6) So each image patch is first unrolled (flattened) into
a big vector and gets multiplied with an embedding
matrix which is also learnable, creating embedded
patches. And these embedded patches are combined
with the positional embedding vector and that gets fed
into the Transformer.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
▪ (7) With the only difference being, instead of a decoder
the output from the encoder is passed directly into a
Feed Forward Neural Network to obtain the
classification output.
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
What’snew?
▪ Convolutions are neglected
▪ Variants of ViT has no significance
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
ViT
Performance
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
ToStartWith
▪ https://github.com/ferdinjoe/ViT
▪ Tutorial involving Keras,Tensorflow and ViT
▪ Contact Me: ferdin@tni.ac.th
Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)

More Related Content

What's hot (20)

PDF
Transformer in Computer Vision
Dongmin Choi
 
PPTX
Transformer in Vision
Sangmin Woo
 
PPTX
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
PPTX
Semantic Segmentation Methods using Deep Learning
Sungjoon Choi
 
PPTX
Introduction to Visual transformers
leopauly
 
PPTX
物体検出フレームワークMMDetectionで快適な開発
Tatsuya Suzuki
 
PPTX
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
Deep Learning JP
 
PDF
Transformer 動向調査 in 画像認識(修正版)
Kazuki Maeno
 
PPTX
[DL輪読会]MetaFormer is Actually What You Need for Vision
Deep Learning JP
 
PDF
Intro to Deep Learning for Computer Vision
Christoph Körner
 
PDF
SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~
SSII
 
PDF
[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기
JungHyun Hong
 
PDF
Anomaly detection 系の論文を一言でまとめた
ぱんいち すみもと
 
PDF
[DL Hacks]Semantic Instance Segmentation with a Discriminative Loss Function
Deep Learning JP
 
PDF
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
PPTX
State of transformers in Computer Vision
Deep Kayal
 
PDF
[DL輪読会]The Neural Process Family−Neural Processes関連の実装を読んで動かしてみる−
Deep Learning JP
 
PDF
Introduction to Recurrent Neural Network
Yan Xu
 
PDF
論文紹介 Pixel Recurrent Neural Networks
Seiya Tokui
 
PDF
Deep Learning: Application & Opportunity
iTrain
 
Transformer in Computer Vision
Dongmin Choi
 
Transformer in Vision
Sangmin Woo
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Semantic Segmentation Methods using Deep Learning
Sungjoon Choi
 
Introduction to Visual transformers
leopauly
 
物体検出フレームワークMMDetectionで快適な開発
Tatsuya Suzuki
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
Deep Learning JP
 
Transformer 動向調査 in 画像認識(修正版)
Kazuki Maeno
 
[DL輪読会]MetaFormer is Actually What You Need for Vision
Deep Learning JP
 
Intro to Deep Learning for Computer Vision
Christoph Körner
 
SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~
SSII
 
[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기
JungHyun Hong
 
Anomaly detection 系の論文を一言でまとめた
ぱんいち すみもと
 
[DL Hacks]Semantic Instance Segmentation with a Discriminative Loss Function
Deep Learning JP
 
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
State of transformers in Computer Vision
Deep Kayal
 
[DL輪読会]The Neural Process Family−Neural Processes関連の実装を読んで動かしてみる−
Deep Learning JP
 
Introduction to Recurrent Neural Network
Yan Xu
 
論文紹介 Pixel Recurrent Neural Networks
Seiya Tokui
 
Deep Learning: Application & Opportunity
iTrain
 

Similar to Transforming deep into transformers – a computer vision approach (20)

PPTX
SDN/NFV Sudanese Research Group Initiative
Ahmed Hassan
 
PDF
(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...
CaoVuThang
 
PDF
Download full ebook of Vlsi Zhongfeng Wang instant download pdf
jbhdfolhu011
 
PDF
SDN-based Inter-Cloud Federation for OF@TEIN
GIST (Gwangju Institute of Science and Technology)
 
PDF
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
IRJET Journal
 
PDF
Netlist Optimization for CMOS Place and Route in MICROWIND
IRJET Journal
 
PDF
Capella Days 2021 | Exploring the various roles of MBSE in the digital thread
Obeo
 
PPTX
Icccn 1.0
Gary Berger
 
PDF
Beyond the future: a practical approach of Telco changes
Francesco Foresta
 
DOCX
But is it Art(ificial Intelligence)?
Alan Sardella
 
PDF
Transformer models for FER
IRJET Journal
 
PDF
ppbench - A Visualizing Network Benchmark for Microservices
Nane Kratzke
 
PDF
OIF-plenary-Jan-20-2015_SDN-WAN-Loukas_oif2015.083
Loukas Paraschis
 
PDF
CV of Minfeng Hu
Minfeng Hu
 
PDF
“COVID-19 Safe Distancing Measures in Public Spaces with Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
PDF
A Visual Canvas for Judging New Technologies
Srinath Perera
 
PDF
A COMPARISON OF FOUR SERIES OF CISCO NETWORK PROCESSORS
aciijournal
 
PDF
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Wee Hyong Tok
 
PDF
SDN Introduction
Steve Goeringer
 
SDN/NFV Sudanese Research Group Initiative
Ahmed Hassan
 
(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...
CaoVuThang
 
Download full ebook of Vlsi Zhongfeng Wang instant download pdf
jbhdfolhu011
 
SDN-based Inter-Cloud Federation for OF@TEIN
GIST (Gwangju Institute of Science and Technology)
 
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
IRJET Journal
 
Netlist Optimization for CMOS Place and Route in MICROWIND
IRJET Journal
 
Capella Days 2021 | Exploring the various roles of MBSE in the digital thread
Obeo
 
Icccn 1.0
Gary Berger
 
Beyond the future: a practical approach of Telco changes
Francesco Foresta
 
But is it Art(ificial Intelligence)?
Alan Sardella
 
Transformer models for FER
IRJET Journal
 
ppbench - A Visualizing Network Benchmark for Microservices
Nane Kratzke
 
OIF-plenary-Jan-20-2015_SDN-WAN-Loukas_oif2015.083
Loukas Paraschis
 
CV of Minfeng Hu
Minfeng Hu
 
“COVID-19 Safe Distancing Measures in Public Spaces with Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
A Visual Canvas for Judging New Technologies
Srinath Perera
 
A COMPARISON OF FOUR SERIES OF CISCO NETWORK PROCESSORS
aciijournal
 
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Wee Hyong Tok
 
SDN Introduction
Steve Goeringer
 
Ad

More from Ferdin Joe John Joseph PhD (20)

PDF
Invited Talk DGTiCon 2022
Ferdin Joe John Joseph PhD
 
PDF
Week 12: Cloud AI- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 11: Cloud Native- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 10: Cloud Security- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Ferdin Joe John Joseph PhD
 
PDF
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Ferdin Joe John Joseph PhD
 
PDF
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Ferdin Joe John Joseph PhD
 
PDF
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Ferdin Joe John Joseph PhD
 
PDF
Hadoop in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
PDF
Cloud Computing Essentials in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
PDF
Week 11: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Week 10: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Week 9: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Week 8: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Programming for Data Analysis: Week 4
Ferdin Joe John Joseph PhD
 
Invited Talk DGTiCon 2022
Ferdin Joe John Joseph PhD
 
Week 12: Cloud AI- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 11: Cloud Native- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 10: Cloud Security- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Ferdin Joe John Joseph PhD
 
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Ferdin Joe John Joseph PhD
 
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Ferdin Joe John Joseph PhD
 
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Ferdin Joe John Joseph PhD
 
Hadoop in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
Cloud Computing Essentials in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
Week 11: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Week 10: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Week 9: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Week 8: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Programming for Data Analysis: Week 4
Ferdin Joe John Joseph PhD
 
Ad

Recently uploaded (20)

PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 

Transforming deep into transformers – a computer vision approach

  • 1. Transforming Deep into Transformers – A Computer Vision Approach Asst.Prof. Dr. Ferdin Joe John Joseph Data Science and Analytics Laboratory Faculty of Information Technology Thai – Nichi Institute of Technology, Bangkok Keynote Talk to International Conference on Research and Development in Science,Engineering and Technology (ICRDSET 2021)
  • 2. Concepts ▪ Deep Learning (CNN, RNN) ▪ Transformation to Vision Transformers (ViT) Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 3. DeepLearning ▪ Convolutional Neural Networks (CNN) ▪ Recurrent Neural Networks (RNN) ▪ Long Short Term Memory (LSTM) Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 4. Transformation ▪ The era of CNN, RNN and LSTM are coming to an end ▪ End? What Next? Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 5. Transformer Architecture Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 6. Architecture Decoded ▪ Encoder is at the left and decoder to the right ▪ Encoder takes the input sequence ▪ Understanding encoder is enough for ViT Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 7. Transformer Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 8. Encoder ▪ (1) The input data first gets embedded into a vector.The embedding layer helps us grab a learned vector representation for each word. ▪ (2) In the next stage a positional encoding is injected into the input embeddings.This is because a transformer has no idea about the order of the sequence that is being passed as input - for example a sentence. ▪ (3) Now the multi-headed attention is where things get a little different. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 9. ▪ (4) Multi-Headed Attention consists of three learnable vectors. Query, Key and Value vectors.The motivation of this reportedly comes from information retrieval where you search (query) and the search engine compares your query with a key and responds with a value. ▪ (5) The Q and K representations undergo a dot product matrix multiplication to produce a score matrix which represents how much a word has to attend to every other word. Higher score means more attention and vice-versa. ▪ (6) Then the Score matrix is scaled down according to the dimensions of the Q and K vectors.This is to ensure more stable gradients as multiplication can have exploding effects. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 10. ▪ (7) Next the Score matrix is softmaxed to turn attention scores into probabilities. Obviously higher scores are heightened and lower scores are depressed.This ensures the model to be confident on which words to attend to. ▪ (8) Then the resultant matrix with probabilites is multiplied with the value vector.This will make the higher probaility scores the model has learned to be more important.The low scoring words will effectively drown out to become irrelevant. ▪ (9) Then, the concatenated output of QK and V vectors are fed into the Linear layer to process further. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 11. ▪ (10) Self-Attention is performed for each word in the sequence. Since one doesn't depend on the other a copy of the self attention module can be used to process everything simultaneously making this multi- headed. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 12. Encoder ▪ (11) Then the output value vectors are concatenated and added to the residual connection coming from the input layer and then the resultant representation is passed into a LayerNorm for normalization. (Residual connection help gradients flow through the network and LayernNorm helps reduce the training time by a small fraction and stabilize the network) ▪ (12) Further, the output is passed into a point-wise feed forward network to obtain an even richer representation. ▪ (13) The outputs are again Layer-normed and residuals are added from the previous layer. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 13. ▪ (14) The output from the encoder along with the inputs (if any) from the previous time steps/words are fed into the decoder where the outputs undergo masked-multi headed attention before being fed into the next attention layer along with the output from encoder. ▪ (15) Masked multi headed attention is necessary because the network shouldn't have any visibility into the words that are to come later in the sequence while decoding, to ensure there is no leak.This is done by masking the entries of words that come later in the series in the Score matrix. Current and previous words in the sequence are added with 1 and the future word scores are added with -inf.This ensures the future words in the series get drowned out into 0 when performing softmax to obtain the probabilities, while the rest are retained. ▪ (16) There are residual connections here as well, to improve the flow of gradients. Finally the output is sent to a Linear layer and softmaxed to obtain the outputs in probabilities. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 14. NLP Transformers ▪ Transformer XL ▪ Google’s BERT (Bidirectional Encoder Representations from Transformers) Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 15. Transformation ▪ Vision Transformers (ViT) ▪ https://arxiv.org/pdf/2010.11929.pdf ▪ Recently published article (November 2020) ▪ Most of the recent papers in top tier conferences are using this Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 16. ViT Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 17. Vision Transformer Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 18. Working Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 19. ▪ (1) They are only using the Encoder part of the transformer but the difference is in how they are feeding the images into the network. ▪ (2) They are breaking down the image into fixed size patches. So one of these patches can be of dimension 16x16 or 32x32 as proposed in the paper. More patches means more simpler it is to train these networks as the patches themselves get smaller. Hence we have that in the title - "An Image is worth 16x16 words". ▪ (3) The patches are then unrolled (flattened) and sent for further processing into the network. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 20. ▪ (1) They are only using the Encoder part of the transformer but the difference is in how they are feeding the images into the network. ▪ (2) They are breaking down the image into fixed size patches. So one of these patches can be of dimension 16x16 or 32x32 as proposed in the paper. More patches means more simpler it is to train these networks as the patches themselves get smaller. Hence we have that in the title - "An Image is worth 16x16 words". ▪ (3) The patches are then unrolled (flattened) and sent for further processing into the network. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 21. ▪ (4) Unlike NNs here the model has no idea whatsoever about the position of the samples in the sequence, here each sample is a patch from the input image. So the image is fed along with a positional embedding vector and into the encoder. One thing to note here is the positional embeddings are also learnable so you don't actually feed hard-coded vectors w.r.t to their positions. ▪ (5) There is also a special token at the start just like BERT in NLP. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 22. ▪ (6) So each image patch is first unrolled (flattened) into a big vector and gets multiplied with an embedding matrix which is also learnable, creating embedded patches. And these embedded patches are combined with the positional embedding vector and that gets fed into the Transformer. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 23. ▪ (7) With the only difference being, instead of a decoder the output from the encoder is passed directly into a Feed Forward Neural Network to obtain the classification output. Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 24. What’snew? ▪ Convolutions are neglected ▪ Variants of ViT has no significance Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 25. ViT Performance Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)
  • 26. ToStartWith ▪ https://github.com/ferdinjoe/ViT ▪ Tutorial involving Keras,Tensorflow and ViT ▪ Contact Me: [email protected] Keynote Talk to International Conference on Research and Development in Science, Engineering and Technology (ICRDSET 2021)