Scalable and Order-robust Continual Learning with Additive Parameter Decomposition

Scalable and Order-robust Continual Learning
with Additive Parameter Decomposition
𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽 𝑌𝑌𝑌𝑌𝑌𝑌𝑛𝑛1, 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐾𝐾𝐾𝐾𝑚𝑚2, 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑌𝑌𝑌𝑌𝑌𝑌𝑔𝑔1,2, 𝑎𝑎𝑎𝑎𝑎𝑎 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐽𝐽𝐽𝐽 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝑔𝑔1,2
KAIST1, AITRICS2

Continual Learning of a Machine
Continual learning is often formulated as an incremental/online multi-task learning
that models complex task-to-task relationships by weights in NN.
t-2 t-1 t
Learning
Model
Learned knowledge
3) New knowledge is s
tored for future use
2) Knowledge is transferred
from previously Learned tasks
1) Tasks are received in a
sequential order
4) Refine existing
knowledge

Challenges: Catastrophic Forgetting
Introduction of new tasks can result in semantic drift or catastrophic forgetting,
where original meaning of the features change as they fit to later tasks.
Jaehong Yoon et al., “Scalable and Order-robust Continual Learning with Additive Parameter Decomposition”, ICLR 2020.
𝑾𝑾 1
𝑾𝑾 2
New task
+

Challenges: Scalability
Even with well-defined regularizers, it is very hard to completely avoid catastrophic
forgetting, since in practice, the model may encounter an unlimited number of tasks.
Toy-sized Continual Learning
…
Large-scaled Continual Learning
Continual learning model needs to guarantee their scalability to a large number of
tasks, with respect to efficiency as to memory usage and training time.

Challenges: Task-order Sensitivity
Diseases
Disease
Classification
ResultInputTask
Diseases
Diseases
OrderA
OrderB
?
The task order that the model trains on has a large impact on continual learning model
due to the unidirectional knowledge transfer from earlier tasks to later one.

Additive Parameter Decomposition (APD)
Conceptually, our model, APD additively decomposes the model parameters
into task-shared (𝝈𝝈) and highly-sparse task-adaptive parameters(𝝉𝝉).
Further, we periodically regroup task-adaptive parameters to obtain hierarchically
shared parameters for utilizing the varying degree of knowledge sharing structure.
𝝉𝝉1:𝑡𝑡
𝝈𝝈
ℳ1:𝑡𝑡

Task-order Robust (Reliable) Continual Learning
It is achieved by following mechanisms:
• Decomposition of parameters into task-shared and task-adaptive parts.
• Sparsity-inducing regularization on the task-adaptive parameters.
𝝈𝝈 𝝉𝝉1:𝑡𝑡
+
ℳ1:𝑡𝑡
𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 ℒ 𝝈𝝈 ⊗ 𝒎𝒎𝑡𝑡 + 𝝉𝝉𝑡𝑡 ; 𝒟𝒟𝑡𝑡 + 𝜆𝜆1 �
𝑖𝑖=1
𝑡𝑡
||𝝉𝝉𝑖𝑖||1 + 𝜆𝜆2 �
𝑖𝑖=1
𝑡𝑡−1
||𝜽𝜽𝑖𝑖
∗
− (𝝈𝝈 ⊗ 𝒎𝒎𝑖𝑖 + 𝝉𝝉𝑖𝑖)||2
2
𝝈𝝈, 𝝉𝝉1:𝑡𝑡, 𝒗𝒗1:𝑡𝑡
• The retroactive update of previous task-adaptive parameters to reflect the change in
the task-shared parameters prevents them from drifting away from their original
solutions.
𝜽𝜽𝑖𝑖
∗
: Approximated Solution of previous tasks i

Experimental Results
APD variants outperforms recent expansion-based continual learning benchmarks
with minimal capacity expansion, and training time.

Experimental Results
APD shows remarkably superior and reliable performance in terms of Task-order
fairness (robustness).

Large-scaled (# of tasks) Continual Learning
We further validate the scalability of our model with large-scale continual learning
experiments on Omniglot dataset, which has 100 tasks.
The plot shows that our APD scales well, showing logarithmic growth in network
capacity (the number of parameters), while PGN shows linear growth.
Models Capacity Accuracy
STL 10,000% 82.13±0.08%
L2T 1,599% 64.65±1.76%
EWC 1,599% 68.66±1.92%
PGN-large 1,543% 79.35±0.12%
PGN-small 1,045% 73.65±0.27%
APD-large 943% 81.60±0.53%
APD-small 649% 81.20±0.62%
~ 7.95%

Preventing Catastrophic Forgetting
APD variants show no sign of catastrophic forgetting on those tasks, although
their performances marginally change during the course of training.

Selective Task Forgetting
There is no performance degeneration on non-target tasks, since dropping out a
task-adaptive parameter for a specific task will not affect for the remaining tasks.
Forgetting (Training Step 5)Forgetting (Training Step 3)
This ability to selectively forget is another important advantage of our model that
makes it practical in lifelong learning scenarios.

Conclusion
• We tackle practically important and novel problems in continual learning that have
been overlooked thus far, such as scalability and order-robustness.
• We introduce a novel CL framework which is based on the decomposition of the
network parameters into shared and sparse task-adaptive parameters.
• We perform extensive experimental validation of our model on multiple datasets
against recent continual learning methods. APD is significantly superior to them
in terms of the accuracy, efficiency, scalability, as well as order-robustness.

Scalable and Order-robust Continual Learning with Additive Parameter Decomposition

More Related Content

What's hot (20)

More from MLAI2 (20)

Recently uploaded (20)

Scalable and Order-robust Continual Learning with Additive Parameter Decomposition