SlideShare a Scribd company logo
Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, Richard E. Turner
Variational continual learning
Presenter: Giang Nguyen
KAIST, May 2019
Contents
1
• Introduction
2
• Variational Continual Learning
3
• Experiments
4
• Conclusion
INTRODUCTION
Continual learning
4
Continual learning
5
• Expectations of CL
‒ Online learning: learning occurs at every moment
‒ Presence of transfer: able to transfer from previous tasks to new ones
‒ Resistance to catastrophic forgetting
‒ No direct access to previous experience
Challenge for Continual Learning
• We need a balance between adapting to recent data and retaining
knowledge from old data because:
• Too much plasticity leads to the catastrophic forgetting problem
• Too much stability leads to an inability to adapt
6
Solutions
• IMM: One approach trains individual models on each task and then
carries out a second stage of training to combine them
7
Lee, Sang-Woo, et al. "Overcoming catastrophic forgetting by incremental moment matching." NIPS. 2017
Solutions
• EWC: A more elegant and more flexible approach maintains a single model and
uses a single type of regularized training that prevents drastic changes in the
parameters which have a large influence on prediction, but allows other
parameters to change more freely
8
Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." (2017)
Solutions
• Variational Continual Learning
9
Online variational
inference
Monte-Carlo VI
for NN
VCL
Solutions
10
• VCL
‒ Online learning: learning occurs at every moment
‒ Presence of transfer: able to transfer from previous tasks to new ones
‒ Resistance to catastrophic forgetting
‒ No direct access to previous experience
Variational Continual Learning
Variational Continual Learning
• Ideas:
• Fusing online variational inference and sampling method (1)
• Use of corset to deal with the catastrophic forgetting problem (2)
• Both ideas have been investigated in Bayesian literature
• (2) has been recently investigated in continual learning
→ The authors are the first to investigate the effectiveness of the first idea
for continual learning.
12
Use of coreset
• In order to mitigate semantic drift problem, VCL includes a small
representative set of data from previously observed tasks.
13
Approximate inference for discriminative CL
14
• Online multi-tasking learning transfer learning
Approximate inference for discriminative CL
15
TASK 1: 𝑝 𝑌 1
, 𝜃 𝛼) =
𝑛=1
𝑁
𝑝(𝑦𝑛
1
|𝜃, 𝛼) 𝑝 𝜃|𝛼
Approximate inference for discriminative CL
16
TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) =
𝑛=1
𝑁
𝑝(𝑦𝑛
1
|𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑞1
∗
(𝜃)
𝑝 𝑌 1 𝛼) ≈ 𝑍1 = 𝑞1
∗
(𝜃) 𝑑𝜃
𝑝 𝜃 𝑌 1
, 𝛼) ≈ 𝑞1 𝜃 =
𝑞1
∗(𝜃)
𝑍1
Analytically
Integrate 𝜃𝑑𝜃
Approximate inference for discriminative CL
17
TASK 2: 𝑝 𝑌 2
, 𝑌 1
, 𝜃 𝛼) =
𝑚=1
𝑀
𝑝(𝑦 𝑚
2
|𝜃, 𝛼)
𝑛=1
𝑁
𝑝(𝑦𝑛
1
|𝜃, 𝛼) 𝑝 𝜃|𝛼
≈ 𝑚=1
𝑀
𝑝(𝑦 𝑚
2
|𝜃, 𝛼) 𝑞1
∗
(𝜃)
≈ 𝑞2
∗
(𝜃)
TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) =
𝑛=1
𝑁
𝑝(𝑦𝑛
1
|𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑞1
∗
(𝜃)
𝑝 𝑌 1
𝛼) ≈ 𝑍1 = 𝑞1
∗
(𝜃) 𝑑𝜃
𝑝 𝜃 𝑌 1 , 𝛼) ≈ 𝑞1 𝜃 =
𝑞1
∗(𝜃)
𝑍1
Analytically
Integrate 𝜃𝑑𝜃
prevents need
to access 𝑌 1
Approximate inference for discriminative CL
18
Approximate inference online Neural networks
Experiments
Approximate inference for discriminative CL
20
• Permuted MNIST
Approximate inference for discriminative CL
21
• Split MNIST
Conclusion
Conclusion
23
• Continual learning is naturally handled by Baysian inference: allows
multi-task transfer and avoids catastrophic forgetting
• VCL is state-of-the-art continual learning methods besides Synaptic
Intelligence
• Via performing variational inference in a kind of online manner, one
can address continual learning with considerations of model
uncertainty
Quizzes & Discussion
Quizzes
25
1) As we are working on task 1, so we use data from task 1 to solve
problem of task 2, could we use additional data from task 2 to get
better result? And why?
Discussion
26
1) What is the trade-off when the size of the coreset increases?
2) Will this work on large-scale computer vision applications?
Thank you!

More Related Content

What's hot (20)

PDF
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...
Universitat Politècnica de Catalunya
 
PDF
Scalable and Order-robust Continual Learning with Additive Parameter Decompos...
MLAI2
 
PDF
XGBoost & LightGBM
Gabriel Cypriano Saca
 
PDF
007 20151214 Deep Unsupervised Learning using Nonequlibrium Thermodynamics
Ha Phuong
 
ODP
[Paper Review] Continual learning with deep generative replay
Pyungin Paek
 
PDF
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
PDF
Continual learning: Variational continual learning
Wonjun Jeong
 
PDF
Single Image Super Resolution Overview
LEE HOSEONG
 
PDF
Introduction to Diffusion Models
Sangwoo Mo
 
PDF
Pixel RNN to Pixel CNN++
Dongheon Lee
 
PPTX
Computer Vision.pptx
GDSCIIITDHARWAD
 
PDF
Variational Autoencoders For Image Generation
Jason Anderson
 
PPTX
Stochastic Gradient Decent (SGD).pptx
Shubham Jaybhaye
 
PDF
Dimensionality Reduction
mrizwan969
 
PDF
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
Deep Learning JP
 
PDF
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Joonyoung Yi
 
PDF
Generative adversarial networks
남주 김
 
PDF
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
PDF
Densenet CNN
ArunKumar7374
 
PPTX
Face recognition v1
San Kim
 
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...
Universitat Politècnica de Catalunya
 
Scalable and Order-robust Continual Learning with Additive Parameter Decompos...
MLAI2
 
XGBoost & LightGBM
Gabriel Cypriano Saca
 
007 20151214 Deep Unsupervised Learning using Nonequlibrium Thermodynamics
Ha Phuong
 
[Paper Review] Continual learning with deep generative replay
Pyungin Paek
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
Continual learning: Variational continual learning
Wonjun Jeong
 
Single Image Super Resolution Overview
LEE HOSEONG
 
Introduction to Diffusion Models
Sangwoo Mo
 
Pixel RNN to Pixel CNN++
Dongheon Lee
 
Computer Vision.pptx
GDSCIIITDHARWAD
 
Variational Autoencoders For Image Generation
Jason Anderson
 
Stochastic Gradient Decent (SGD).pptx
Shubham Jaybhaye
 
Dimensionality Reduction
mrizwan969
 
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
Deep Learning JP
 
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Joonyoung Yi
 
Generative adversarial networks
남주 김
 
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
Densenet CNN
ArunKumar7374
 
Face recognition v1
San Kim
 

Similar to Variational continual learning (20)

PDF
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
Vincenzo Lomonaco
 
PPTX
What Metrics Matter?
CS, NcState
 
PPTX
Deep Learning in Recommender Systems - RecSys Summer School 2017
Balázs Hidasi
 
PPTX
Computational Giants_nhom.pptx
ThAnhonc
 
PDF
Learning Theory 101 ...and Towards Learning the Flat Minima
Sangwoo Mo
 
PDF
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Sangwoo Mo
 
PDF
DEF CON 24 - Clarence Chio - machine duping 101
Felipe Prado
 
PDF
Learning to learn unlearned feature for segmentation
NAVER Engineering
 
PDF
Towards better analysis of deep convolutional neural networks
曾 子芸
 
PPTX
250113_HW_Labsemimar[Progressive layered extraction (PLE): A novel multi-task...
thanhdowork
 
PDF
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Association for Computational Linguistics
 
PDF
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Bayes Nets meetup London
 
PDF
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Ian Morgan
 
PDF
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Sergey Karayev
 
PPTX
[PR12] Inception and Xception - Jaejun Yoo
JaeJun Yoo
 
PDF
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk
 
PDF
SMART Seminar Series: "Keep it SMART, keep it simple! – Challenging complexit...
SMART Infrastructure Facility
 
PPT
Artificial neural network model & hidden layers in multilayer artificial neur...
Muhammad Ishaq
 
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
Vincenzo Lomonaco
 
What Metrics Matter?
CS, NcState
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Balázs Hidasi
 
Computational Giants_nhom.pptx
ThAnhonc
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Sangwoo Mo
 
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Sangwoo Mo
 
DEF CON 24 - Clarence Chio - machine duping 101
Felipe Prado
 
Learning to learn unlearned feature for segmentation
NAVER Engineering
 
Towards better analysis of deep convolutional neural networks
曾 子芸
 
250113_HW_Labsemimar[Progressive layered extraction (PLE): A novel multi-task...
thanhdowork
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Association for Computational Linguistics
 
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Bayes Nets meetup London
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Ian Morgan
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Sergey Karayev
 
[PR12] Inception and Xception - Jaejun Yoo
JaeJun Yoo
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk
 
SMART Seminar Series: "Keep it SMART, keep it simple! – Challenging complexit...
SMART Infrastructure Facility
 
Artificial neural network model & hidden layers in multilayer artificial neur...
Muhammad Ishaq
 
Ad

More from Nguyen Giang (8)

PPTX
Introduction to Interpretable Machine Learning
Nguyen Giang
 
PDF
Show observe and tell giang nguyen
Nguyen Giang
 
PPTX
Scalability fs v2
Nguyen Giang
 
PPTX
Survey on Script-based languages to write a Chatbot
Nguyen Giang
 
PPTX
How Tala works in credit score
Nguyen Giang
 
PPTX
Virtual assistant with amazon alexa
Nguyen Giang
 
PPTX
AIML Introduction
Nguyen Giang
 
PPTX
ECG Detector deployed based on OPENMSP430 open-core
Nguyen Giang
 
Introduction to Interpretable Machine Learning
Nguyen Giang
 
Show observe and tell giang nguyen
Nguyen Giang
 
Scalability fs v2
Nguyen Giang
 
Survey on Script-based languages to write a Chatbot
Nguyen Giang
 
How Tala works in credit score
Nguyen Giang
 
Virtual assistant with amazon alexa
Nguyen Giang
 
AIML Introduction
Nguyen Giang
 
ECG Detector deployed based on OPENMSP430 open-core
Nguyen Giang
 
Ad

Recently uploaded (20)

PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 

Variational continual learning

  • 1. Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, Richard E. Turner Variational continual learning Presenter: Giang Nguyen KAIST, May 2019
  • 2. Contents 1 • Introduction 2 • Variational Continual Learning 3 • Experiments 4 • Conclusion
  • 5. Continual learning 5 • Expectations of CL ‒ Online learning: learning occurs at every moment ‒ Presence of transfer: able to transfer from previous tasks to new ones ‒ Resistance to catastrophic forgetting ‒ No direct access to previous experience
  • 6. Challenge for Continual Learning • We need a balance between adapting to recent data and retaining knowledge from old data because: • Too much plasticity leads to the catastrophic forgetting problem • Too much stability leads to an inability to adapt 6
  • 7. Solutions • IMM: One approach trains individual models on each task and then carries out a second stage of training to combine them 7 Lee, Sang-Woo, et al. "Overcoming catastrophic forgetting by incremental moment matching." NIPS. 2017
  • 8. Solutions • EWC: A more elegant and more flexible approach maintains a single model and uses a single type of regularized training that prevents drastic changes in the parameters which have a large influence on prediction, but allows other parameters to change more freely 8 Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." (2017)
  • 9. Solutions • Variational Continual Learning 9 Online variational inference Monte-Carlo VI for NN VCL
  • 10. Solutions 10 • VCL ‒ Online learning: learning occurs at every moment ‒ Presence of transfer: able to transfer from previous tasks to new ones ‒ Resistance to catastrophic forgetting ‒ No direct access to previous experience
  • 12. Variational Continual Learning • Ideas: • Fusing online variational inference and sampling method (1) • Use of corset to deal with the catastrophic forgetting problem (2) • Both ideas have been investigated in Bayesian literature • (2) has been recently investigated in continual learning → The authors are the first to investigate the effectiveness of the first idea for continual learning. 12
  • 13. Use of coreset • In order to mitigate semantic drift problem, VCL includes a small representative set of data from previously observed tasks. 13
  • 14. Approximate inference for discriminative CL 14 • Online multi-tasking learning transfer learning
  • 15. Approximate inference for discriminative CL 15 TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) = 𝑛=1 𝑁 𝑝(𝑦𝑛 1 |𝜃, 𝛼) 𝑝 𝜃|𝛼
  • 16. Approximate inference for discriminative CL 16 TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) = 𝑛=1 𝑁 𝑝(𝑦𝑛 1 |𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑞1 ∗ (𝜃) 𝑝 𝑌 1 𝛼) ≈ 𝑍1 = 𝑞1 ∗ (𝜃) 𝑑𝜃 𝑝 𝜃 𝑌 1 , 𝛼) ≈ 𝑞1 𝜃 = 𝑞1 ∗(𝜃) 𝑍1 Analytically Integrate 𝜃𝑑𝜃
  • 17. Approximate inference for discriminative CL 17 TASK 2: 𝑝 𝑌 2 , 𝑌 1 , 𝜃 𝛼) = 𝑚=1 𝑀 𝑝(𝑦 𝑚 2 |𝜃, 𝛼) 𝑛=1 𝑁 𝑝(𝑦𝑛 1 |𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑚=1 𝑀 𝑝(𝑦 𝑚 2 |𝜃, 𝛼) 𝑞1 ∗ (𝜃) ≈ 𝑞2 ∗ (𝜃) TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) = 𝑛=1 𝑁 𝑝(𝑦𝑛 1 |𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑞1 ∗ (𝜃) 𝑝 𝑌 1 𝛼) ≈ 𝑍1 = 𝑞1 ∗ (𝜃) 𝑑𝜃 𝑝 𝜃 𝑌 1 , 𝛼) ≈ 𝑞1 𝜃 = 𝑞1 ∗(𝜃) 𝑍1 Analytically Integrate 𝜃𝑑𝜃 prevents need to access 𝑌 1
  • 18. Approximate inference for discriminative CL 18 Approximate inference online Neural networks
  • 20. Approximate inference for discriminative CL 20 • Permuted MNIST
  • 21. Approximate inference for discriminative CL 21 • Split MNIST
  • 23. Conclusion 23 • Continual learning is naturally handled by Baysian inference: allows multi-task transfer and avoids catastrophic forgetting • VCL is state-of-the-art continual learning methods besides Synaptic Intelligence • Via performing variational inference in a kind of online manner, one can address continual learning with considerations of model uncertainty
  • 25. Quizzes 25 1) As we are working on task 1, so we use data from task 1 to solve problem of task 2, could we use additional data from task 2 to get better result? And why?
  • 26. Discussion 26 1) What is the trade-off when the size of the coreset increases? 2) Will this work on large-scale computer vision applications?

Editor's Notes

  • #2: Hello everyone, my name is Giang, the paper today is Variational continual learning by Cuong Nguyen et al.
  • #3: Today presentation will consist of 4 main parts. Intro, VCL, Experiments and Conclusion
  • #4: First, we will briefly surf over the introduction part.
  • #5: Continual learning (CL) is the ability of a model to learn continually from a stream of data, building on what was learnt previously, hence exhibiting positive transfer, as well as being able to remember previously seen tasks.  (Positive transfer is the improvement or embellishment of current knowledge through the gain of additional information or education.  Typically this occurs when performance of a task improves as a result of performing a different but related task. It is essentially using the building blocks of previous to knowledge to learn more -- by learning something similar but different you can strengthen your previous skills as well. )
  • #6: When coming to Continual learning, we have some expectations: * Online learning -- learning occurs at every moment, with no fixed tasks or data sets and no clear boundaries between tasks;  * Presence of transfer (forward/backward) -- the model should be able to transfer from previously seen data or tasks to new ones, as well as possibly new task should help improve performance on older ones;  * Resistance to catastrophic forgetting -- new learning does not destroy performance on previously seen data;  * No direct access to previous experience -- while the model can remember a limited amount of experience, a continual learning algorithm should not have direct access to past tasks or be able to rewind the environment;
  • #7: But there remains challenges in CL. We need a balance between adapting to recent data and retaining knowledge from old data The authors state that: Too much plasticity leads to the catastrophic forgetting problem and too much stability leads to an inability to adapt
  • #8: One solution coming out is IMM: Incremental moment matching. They state that the moments of posterior distributions are matched in an incremental way and introduces various transfer learning techniques.
  • #9: Another solution is EWC (Elastic Weight Consolidation). EWC protects the performance in task A by constraining the parameters to stay in a region of low error for task, where is around (phi star A), trying to solve catastrophic forgetting by constraining important parameters to stay close to their old values.
  • #10: The authors of this paper came up with a solutions fusing online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. Monte Carlo VI: algorithm for learning neural networks with uncertainty on the weights Online variational inference: To deal with online-fashion data
  • #11: Using this solution, all the expectations could be fulfilled.
  • #12: Ok, so we come to know how Variational continual learning works?
  • #13: The authors present two ideas for continual learning in this paper: (1) Combination of online variational inference and sampling method, (2) Use of coreset to deal with the catastrophic forgetting problem. Both ideas have been investigated in Bayesian literature, while the second idea has been recently investigated in continual learning. Therefore, the authors seems to be the first to investigate the effectiveness of the first idea for continual learning.
  • #14: For each task, the new coreset Ct is produced by selecting new data points from the current task and a selection from the old coreset Ct−1. Then, we update the variational distribution for non-corset data points, meaning the dataset that we remove at the data points in corset Now, we compute the final variational distribution. Here the distribution is computed with the presence of Coreset Ct Lastly, we perform prediction and test input to see the effect of corset.
  • #15: To explain the second idea, here the scenario is we have a task followed by another task, and you want to do some training for the first task, and the second task we want to do some training on the information from the second task without revisiting all the first data.
  • #16: We will figure out how to do that using Baysian inference Ok so here we are gonna do that. Here I suppress all the inputs so there is no input x in this slide, everything is implicitly conditioned on the input, I just wanna focus on Y - a collection of outputs for our current task, theta is the parameter of the network and the superscript indicates we are in task 1, and because of Baysian inference, we are going to put some prior on our weight parameters (Monte Carlo VI) Alpha are some hyper parameters. Ok, so here, task 1, its joint distribution and its is the goal of approximate inference. Approximating the joint distribution with the data fixed at the observed value is essentially the central goal of approximate inference.
  • #17: Why do I say that, well, if I take the joint distribution here and I could analytically integrate it, it is in red means you can not generally do this, but if I could do that I would get P of Y given alpha and that’s what we need. And if I divide the joint distribution by the scalar quantity and normalize it, I’d get the posterior distribution. So if I approximate the joint distribution using Q one start using some tractable family like a Gaussian, for example, if I integrated it against theta which is tractable because its Gaussian, I then get an estimate for the normalizing constant, and then I can rescale by that in order to get an approximation for the posterior.
  • #18: Ok, so here is task 1, we have done out approximate inference at taks 1. Notice obviously in the priors is tractable here. There are various options to approximate, I will talk ab some of them later. And now, lets imagine we get to task 2, here we write the joint distribution again, so it involves the prior, the data from task 1 and the data from task 2. And N data points with the first dataset, M data points with the second dataset., and now, what we notice is this chunk over here, the new joint distribution was the exactly what we had for the first task. So what I can do is I just plug my approximation for this Q one start in to this term here, the join distribution. And this of course is our approximation to what the first task told us ab the parameters and what the prior told us ab the parameters. So when I add that in, this is now really nice because it prevents us from needing to access the first tasks data when we are at the second task. We just need to access the previous approximation and the current task likelihood so we can do incremental CL updates without revisiting the old data. This of course is still intractable, so I need to do another approximate inference step to get to Q 2 star which is then intractable which we can then recurse again, okay? So iths is a sort of online approx. inference, using previous app posterior as prior for the next task and recursing.
  • #19: Ok, this I am going to go through very quickly but suffice to say there are a bunch of difference approx. inference schemes in the literature, Laplace approximation, variational free energy method, moment matching and importance sampling are four that I picked out here . If you transfer them to online setting we can plug them into the equations on the prev slide. And we get, I think it’ll be Laplace approx. online VB, assumed density filtering and sequential Monte Carlo when we apply exactly the operation on the previous slide to those fours appx inference schemes. And what’s more we can then apply these schemes to neural networks. And last year DeepMind came up with an elastic, what they call EWC which is essentially applying Laplace approximation to Bayesian neural networks. There are many works before but missing is the variational extension. We know variational methods work reasonably well for Baysian NN. They are often better than Laplace. So it seems sensible that we should go and try and apply only VI where we use KL(q||p) to carry out this production step that I showed you the prev slide and see how it compares to versions of topics. So this is what we did. I should say that these two algorithms have not been applied in the online setting before. They were developed for the batch setting although they use sth that could be directly applied to the online setting.
  • #20: Now we jump into the experiments
  • #21: Ok, we are gonna show you quickly on two tasks. First which is going to deal with covariate shift example. These are standard tasks for CL which previous methods have used as benchmark tasks. So this one is call Permuted MNIST task and task number one comprises of just classifying MNIST. Task number 2 applies a fixed permutation to the pixels in each one of these classes. And then you have to categorize them again as zero one though nine. So the statistics have changed according to permutation. And then task three applies another random permutation, then you have to do it again. At the end, you have to learn a single network with a single head that classifies ones and all scrambled versions of ones to 1 2 and so on. So its statistics change over time. Here is the prev best state of the art methods of EWC paper that I mentioned and SI. These are both considerably better than early stopping on NN. Here is the variational version VCL and here is an enhancement we introduced where you are allowed to keep around a few chosen data points as an extra memory that you propagate with you so you can imagine choosing just a small number of point as an episodic memory that can combine with your approximation that you are propagating forward in order. I jus to show you that memory doesnt do very much here is what happened with keeping a small number of data points as a memory. Its performing about 65%.
  • #22: The second benchmark task is called Split MNIST. So here is how the wat the benchmark’s task work. Where we have a bunch of different task and different heads to our network for each one of those tasks with a common base to that body, to that network. So the way this works is in the first task, we are going to do 0 vs 1 in MNIST. In task 2, we are going to classify 2 and 3, 4 vs 5 and so on. And the hope is that we can leverage the features we have learned in the first task to do better in these later tasks . Again here is EWC and SI, SI is really good on this task which is really interesting. EWC does not perform well. The VCL does pretty well, and drop off towards the end, but still pretty good. If you add the memory again you can close the gap a little bit and this just show the memory doesn’t work well by itself. So you need to combine it with this propagation. So I think one of the interesting things to come out to this is: a) Laplace is not so great for these tasks and coming with variational helps b) Maybe we can get to the bottom of this SI paper and actually find out what its doing in terms of approximate inference and may be there’s a new approximate inference method in there that might perform well in general.
  • #23: Now let I conclude this paper in a slide
  • #24: First, Second, Finally,
  • #25: Here, I prepared some quizzes and discussion. Alright!
  • #26: You probably get better results because a reason could be capacity-limited, but we can hope that it will learn better features with a fixed capacity.
  • #27: I think coreset VCL is equivalent to a message-passing implementation of variational inference, increasing the size of coreset can help model learn better from the previous task but also it can make learning and distribution update more difficult. I think a method to selectively choose coreset will work instead of naïve selective methods in papers (K- center and random) In large-scale, the number of features and parameters is very large, so the approximation could not be effective => more intensive on large-scale tasks.