Variational continual learning

Download as PPTX, PDF

0 likes301 views

This document presents a comprehensive overview of Variational Continual Learning (VCL) and its significance in overcoming the challenges of continual learning, particularly addressing issues like catastrophic forgetting. It discusses various approaches to continual learning, including Incremental Moment Matching (IMM) and Elastic Weight Consolidation (EWC), highlighting how VCL utilizes both online variational inference and coresets to facilitate learning across multiple tasks. The conclusion emphasizes that VCL, utilizing Bayesian inference, effectively manages continual learning, enabling multi-task transfer while mitigating knowledge loss.

Software

Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, Richard E. Turner
Variational continual learning
Presenter: Giang Nguyen
KAIST, May 2019

Contents
1
• Introduction
2
• Variational Continual Learning
3
• Experiments
4
• Conclusion

Continual learning
5
• Expectations of CL
‒ Online learning: learning occurs at every moment
‒ Presence of transfer: able to transfer from previous tasks to new ones
‒ Resistance to catastrophic forgetting
‒ No direct access to previous experience

Challenge for Continual Learning
• We need a balance between adapting to recent data and retaining
knowledge from old data because:
• Too much plasticity leads to the catastrophic forgetting problem
• Too much stability leads to an inability to adapt
6

Solutions
• IMM: One approach trains individual models on each task and then
carries out a second stage of training to combine them
7
Lee, Sang-Woo, et al. "Overcoming catastrophic forgetting by incremental moment matching." NIPS. 2017

Solutions
• EWC: A more elegant and more flexible approach maintains a single model and
uses a single type of regularized training that prevents drastic changes in the
parameters which have a large influence on prediction, but allows other
parameters to change more freely
8
Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." (2017)

Solutions
• Variational Continual Learning
9
Online variational
inference
Monte-Carlo VI
for NN
VCL

Solutions
10
• VCL
‒ Online learning: learning occurs at every moment
‒ Presence of transfer: able to transfer from previous tasks to new ones
‒ Resistance to catastrophic forgetting
‒ No direct access to previous experience

Variational Continual Learning
• Ideas:
• Fusing online variational inference and sampling method (1)
• Use of corset to deal with the catastrophic forgetting problem (2)
• Both ideas have been investigated in Bayesian literature
• (2) has been recently investigated in continual learning
→ The authors are the first to investigate the effectiveness of the first idea
for continual learning.
12

Use of coreset
• In order to mitigate semantic drift problem, VCL includes a small
representative set of data from previously observed tasks.
13

Approximate inference for discriminative CL
14
• Online multi-tasking learning transfer learning

Approximate inference for discriminative CL
15
TASK 1: 𝑝 𝑌 1
, 𝜃 𝛼) =
𝑛=1
𝑁
𝑝(𝑦𝑛
1
|𝜃, 𝛼) 𝑝 𝜃|𝛼

Approximate inference for discriminative CL
16
TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) =
𝑛=1
𝑁
𝑝(𝑦𝑛
1
|𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑞1
∗
(𝜃)
𝑝 𝑌 1 𝛼) ≈ 𝑍1 = 𝑞1
∗
(𝜃) 𝑑𝜃
𝑝 𝜃 𝑌 1
, 𝛼) ≈ 𝑞1 𝜃 =
𝑞1
∗(𝜃)
𝑍1
Analytically
Integrate 𝜃𝑑𝜃

Approximate inference for discriminative CL
17
TASK 2: 𝑝 𝑌 2
, 𝑌 1
, 𝜃 𝛼) =
𝑚=1
𝑀
𝑝(𝑦 𝑚
2
|𝜃, 𝛼)
𝑛=1
𝑁
𝑝(𝑦𝑛
1
|𝜃, 𝛼) 𝑝 𝜃|𝛼
≈ 𝑚=1
𝑀
𝑝(𝑦 𝑚
2
|𝜃, 𝛼) 𝑞1
∗
(𝜃)
≈ 𝑞2
∗
(𝜃)
TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) =
𝑛=1
𝑁
𝑝(𝑦𝑛
1
|𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑞1
∗
(𝜃)
𝑝 𝑌 1
𝛼) ≈ 𝑍1 = 𝑞1
∗
(𝜃) 𝑑𝜃
𝑝 𝜃 𝑌 1 , 𝛼) ≈ 𝑞1 𝜃 =
𝑞1
∗(𝜃)
𝑍1
Analytically
Integrate 𝜃𝑑𝜃
prevents need
to access 𝑌 1

Approximate inference for discriminative CL
18
Approximate inference online Neural networks

Approximate inference for discriminative CL
20
• Permuted MNIST

Approximate inference for discriminative CL
21
• Split MNIST

Conclusion
23
• Continual learning is naturally handled by Baysian inference: allows
multi-task transfer and avoids catastrophic forgetting
• VCL is state-of-the-art continual learning methods besides Synaptic
Intelligence
• Via performing variational inference in a kind of online manner, one
can address continual learning with considerations of model
uncertainty

Quizzes
25
1) As we are working on task 1, so we use data from task 1 to solve
problem of task 2, could we use additional data from task 2 to get
better result? And why?

Discussion
26
1) What is the trade-off when the size of the coreset increases?
2) Will this work on large-scale computer vision applications?

Variational continual learning

1. Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, Richard E. Turner Variational continual learning Presenter: Giang Nguyen KAIST, May 2019

2. Contents 1 • Introduction 2 • Variational Continual Learning 3 • Experiments 4 • Conclusion

3. INTRODUCTION

4. Continual learning 4

5. Continual learning 5 • Expectations of CL ‒ Online learning: learning occurs at every moment ‒ Presence of transfer: able to transfer from previous tasks to new ones ‒ Resistance to catastrophic forgetting ‒ No direct access to previous experience

6. Challenge for Continual Learning • We need a balance between adapting to recent data and retaining knowledge from old data because: • Too much plasticity leads to the catastrophic forgetting problem • Too much stability leads to an inability to adapt 6

7. Solutions • IMM: One approach trains individual models on each task and then carries out a second stage of training to combine them 7 Lee, Sang-Woo, et al. "Overcoming catastrophic forgetting by incremental moment matching." NIPS. 2017

8. Solutions • EWC: A more elegant and more flexible approach maintains a single model and uses a single type of regularized training that prevents drastic changes in the parameters which have a large influence on prediction, but allows other parameters to change more freely 8 Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." (2017)

9. Solutions • Variational Continual Learning 9 Online variational inference Monte-Carlo VI for NN VCL

10. Solutions 10 • VCL ‒ Online learning: learning occurs at every moment ‒ Presence of transfer: able to transfer from previous tasks to new ones ‒ Resistance to catastrophic forgetting ‒ No direct access to previous experience

11. Variational Continual Learning

12. Variational Continual Learning • Ideas: • Fusing online variational inference and sampling method (1) • Use of corset to deal with the catastrophic forgetting problem (2) • Both ideas have been investigated in Bayesian literature • (2) has been recently investigated in continual learning → The authors are the first to investigate the effectiveness of the first idea for continual learning. 12

13. Use of coreset • In order to mitigate semantic drift problem, VCL includes a small representative set of data from previously observed tasks. 13

14. Approximate inference for discriminative CL 14 • Online multi-tasking learning transfer learning

15. Approximate inference for discriminative CL 15 TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) = 𝑛=1 𝑁 𝑝(𝑦𝑛 1 |𝜃, 𝛼) 𝑝 𝜃|𝛼

16. Approximate inference for discriminative CL 16 TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) = 𝑛=1 𝑁 𝑝(𝑦𝑛 1 |𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑞1 ∗ (𝜃) 𝑝 𝑌 1 𝛼) ≈ 𝑍1 = 𝑞1 ∗ (𝜃) 𝑑𝜃 𝑝 𝜃 𝑌 1 , 𝛼) ≈ 𝑞1 𝜃 = 𝑞1 ∗(𝜃) 𝑍1 Analytically Integrate 𝜃𝑑𝜃

17. Approximate inference for discriminative CL 17 TASK 2: 𝑝 𝑌 2 , 𝑌 1 , 𝜃 𝛼) = 𝑚=1 𝑀 𝑝(𝑦 𝑚 2 |𝜃, 𝛼) 𝑛=1 𝑁 𝑝(𝑦𝑛 1 |𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑚=1 𝑀 𝑝(𝑦 𝑚 2 |𝜃, 𝛼) 𝑞1 ∗ (𝜃) ≈ 𝑞2 ∗ (𝜃) TASK 1: 𝑝 𝑌 1 , 𝜃 𝛼) = 𝑛=1 𝑁 𝑝(𝑦𝑛 1 |𝜃, 𝛼) 𝑝 𝜃|𝛼 ≈ 𝑞1 ∗ (𝜃) 𝑝 𝑌 1 𝛼) ≈ 𝑍1 = 𝑞1 ∗ (𝜃) 𝑑𝜃 𝑝 𝜃 𝑌 1 , 𝛼) ≈ 𝑞1 𝜃 = 𝑞1 ∗(𝜃) 𝑍1 Analytically Integrate 𝜃𝑑𝜃 prevents need to access 𝑌 1

18. Approximate inference for discriminative CL 18 Approximate inference online Neural networks

19. Experiments

20. Approximate inference for discriminative CL 20 • Permuted MNIST

21. Approximate inference for discriminative CL 21 • Split MNIST

22. Conclusion

23. Conclusion 23 • Continual learning is naturally handled by Baysian inference: allows multi-task transfer and avoids catastrophic forgetting • VCL is state-of-the-art continual learning methods besides Synaptic Intelligence • Via performing variational inference in a kind of online manner, one can address continual learning with considerations of model uncertainty

24. Quizzes & Discussion

25. Quizzes 25 1) As we are working on task 1, so we use data from task 1 to solve problem of task 2, could we use additional data from task 2 to get better result? And why?

26. Discussion 26 1) What is the trade-off when the size of the coreset increases? 2) Will this work on large-scale computer vision applications?

27. Thank you!

Editor's Notes

#2: Hello everyone, my name is Giang, the paper today is Variational continual learning by Cuong Nguyen et al.
#3: Today presentation will consist of 4 main parts. Intro, VCL, Experiments and Conclusion
#4: First, we will briefly surf over the introduction part.
#5: Continual learning (CL) is the ability of a model to learn continually from a stream of data, building on what was learnt previously, hence exhibiting positive transfer, as well as being able to remember previously seen tasks. (Positive transfer is the improvement or embellishment of current knowledge through the gain of additional information or education. Typically this occurs when performance of a task improves as a result of performing a different but related task. It is essentially using the building blocks of previous to knowledge to learn more -- by learning something similar but different you can strengthen your previous skills as well. )
#6: When coming to Continual learning, we have some expectations: * Online learning -- learning occurs at every moment, with no fixed tasks or data sets and no clear boundaries between tasks; * Presence of transfer (forward/backward) -- the model should be able to transfer from previously seen data or tasks to new ones, as well as possibly new task should help improve performance on older ones; * Resistance to catastrophic forgetting -- new learning does not destroy performance on previously seen data; * No direct access to previous experience -- while the model can remember a limited amount of experience, a continual learning algorithm should not have direct access to past tasks or be able to rewind the environment;
#7: But there remains challenges in CL. We need a balance between adapting to recent data and retaining knowledge from old data The authors state that: Too much plasticity leads to the catastrophic forgetting problem and too much stability leads to an inability to adapt
#8: One solution coming out is IMM: Incremental moment matching. They state that the moments of posterior distributions are matched in an incremental way and introduces various transfer learning techniques.
#9: Another solution is EWC (Elastic Weight Consolidation). EWC protects the performance in task A by constraining the parameters to stay in a region of low error for task, where is around (phi star A), trying to solve catastrophic forgetting by constraining important parameters to stay close to their old values.
#10: The authors of this paper came up with a solutions fusing online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. Monte Carlo VI: algorithm for learning neural networks with uncertainty on the weights Online variational inference: To deal with online-fashion data
#11: Using this solution, all the expectations could be fulfilled.
#12: Ok, so we come to know how Variational continual learning works?
#13: The authors present two ideas for continual learning in this paper: (1) Combination of online variational inference and sampling method, (2) Use of coreset to deal with the catastrophic forgetting problem. Both ideas have been investigated in Bayesian literature, while the second idea has been recently investigated in continual learning. Therefore, the authors seems to be the first to investigate the effectiveness of the first idea for continual learning.
#14: For each task, the new coreset Ct is produced by selecting new data points from the current task and a selection from the old coreset Ct−1. Then, we update the variational distribution for non-corset data points, meaning the dataset that we remove at the data points in corset Now, we compute the final variational distribution. Here the distribution is computed with the presence of Coreset Ct Lastly, we perform prediction and test input to see the effect of corset.
#15: To explain the second idea, here the scenario is we have a task followed by another task, and you want to do some training for the first task, and the second task we want to do some training on the information from the second task without revisiting all the first data.
#16: We will figure out how to do that using Baysian inference Ok so here we are gonna do that. Here I suppress all the inputs so there is no input x in this slide, everything is implicitly conditioned on the input, I just wanna focus on Y - a collection of outputs for our current task, theta is the parameter of the network and the superscript indicates we are in task 1, and because of Baysian inference, we are going to put some prior on our weight parameters (Monte Carlo VI) Alpha are some hyper parameters. Ok, so here, task 1, its joint distribution and its is the goal of approximate inference. Approximating the joint distribution with the data fixed at the observed value is essentially the central goal of approximate inference.
#17: Why do I say that, well, if I take the joint distribution here and I could analytically integrate it, it is in red means you can not generally do this, but if I could do that I would get P of Y given alpha and that’s what we need. And if I divide the joint distribution by the scalar quantity and normalize it, I’d get the posterior distribution. So if I approximate the joint distribution using Q one start using some tractable family like a Gaussian, for example, if I integrated it against theta which is tractable because its Gaussian, I then get an estimate for the normalizing constant, and then I can rescale by that in order to get an approximation for the posterior.
#18: Ok, so here is task 1, we have done out approximate inference at taks 1. Notice obviously in the priors is tractable here. There are various options to approximate, I will talk ab some of them later. And now, lets imagine we get to task 2, here we write the joint distribution again, so it involves the prior, the data from task 1 and the data from task 2. And N data points with the first dataset, M data points with the second dataset., and now, what we notice is this chunk over here, the new joint distribution was the exactly what we had for the first task. So what I can do is I just plug my approximation for this Q one start in to this term here, the join distribution. And this of course is our approximation to what the first task told us ab the parameters and what the prior told us ab the parameters. So when I add that in, this is now really nice because it prevents us from needing to access the first tasks data when we are at the second task. We just need to access the previous approximation and the current task likelihood so we can do incremental CL updates without revisiting the old data. This of course is still intractable, so I need to do another approximate inference step to get to Q 2 star which is then intractable which we can then recurse again, okay? So iths is a sort of online approx. inference, using previous app posterior as prior for the next task and recursing.
#19: Ok, this I am going to go through very quickly but suffice to say there are a bunch of difference approx. inference schemes in the literature, Laplace approximation, variational free energy method, moment matching and importance sampling are four that I picked out here . If you transfer them to online setting we can plug them into the equations on the prev slide. And we get, I think it’ll be Laplace approx. online VB, assumed density filtering and sequential Monte Carlo when we apply exactly the operation on the previous slide to those fours appx inference schemes. And what’s more we can then apply these schemes to neural networks. And last year DeepMind came up with an elastic, what they call EWC which is essentially applying Laplace approximation to Bayesian neural networks. There are many works before but missing is the variational extension. We know variational methods work reasonably well for Baysian NN. They are often better than Laplace. So it seems sensible that we should go and try and apply only VI where we use KL(q||p) to carry out this production step that I showed you the prev slide and see how it compares to versions of topics. So this is what we did. I should say that these two algorithms have not been applied in the online setting before. They were developed for the batch setting although they use sth that could be directly applied to the online setting.
#20: Now we jump into the experiments
#21: Ok, we are gonna show you quickly on two tasks. First which is going to deal with covariate shift example. These are standard tasks for CL which previous methods have used as benchmark tasks. So this one is call Permuted MNIST task and task number one comprises of just classifying MNIST. Task number 2 applies a fixed permutation to the pixels in each one of these classes. And then you have to categorize them again as zero one though nine. So the statistics have changed according to permutation. And then task three applies another random permutation, then you have to do it again. At the end, you have to learn a single network with a single head that classifies ones and all scrambled versions of ones to 1 2 and so on. So its statistics change over time. Here is the prev best state of the art methods of EWC paper that I mentioned and SI. These are both considerably better than early stopping on NN. Here is the variational version VCL and here is an enhancement we introduced where you are allowed to keep around a few chosen data points as an extra memory that you propagate with you so you can imagine choosing just a small number of point as an episodic memory that can combine with your approximation that you are propagating forward in order. I jus to show you that memory doesnt do very much here is what happened with keeping a small number of data points as a memory. Its performing about 65%.
#22: The second benchmark task is called Split MNIST. So here is how the wat the benchmark’s task work. Where we have a bunch of different task and different heads to our network for each one of those tasks with a common base to that body, to that network. So the way this works is in the first task, we are going to do 0 vs 1 in MNIST. In task 2, we are going to classify 2 and 3, 4 vs 5 and so on. And the hope is that we can leverage the features we have learned in the first task to do better in these later tasks . Again here is EWC and SI, SI is really good on this task which is really interesting. EWC does not perform well. The VCL does pretty well, and drop off towards the end, but still pretty good. If you add the memory again you can close the gap a little bit and this just show the memory doesn’t work well by itself. So you need to combine it with this propagation. So I think one of the interesting things to come out to this is: a) Laplace is not so great for these tasks and coming with variational helps b) Maybe we can get to the bottom of this SI paper and actually find out what its doing in terms of approximate inference and may be there’s a new approximate inference method in there that might perform well in general.
#23: Now let I conclude this paper in a slide
#24: First, Second, Finally,
#25: Here, I prepared some quizzes and discussion. Alright!
#26: You probably get better results because a reason could be capacity-limited, but we can hope that it will learn better features with a fixed capacity.
#27: I think coreset VCL is equivalent to a message-passing implementation of variational inference, increasing the size of coreset can help model learn better from the previous task but also it can make learning and distribution update more difficult. I think a method to selectively choose coreset will work instead of naïve selective methods in papers (K- center and random) In large-scale, the number of features and parameters is very large, so the approximation could not be effective => more intensive on large-scale tasks.

Variational continual learning

More Related Content

What's hot (20)

Similar to Variational continual learning (20)

More from Nguyen Giang (8)

Recently uploaded (20)

Variational continual learning

Editor's Notes