Module-3_Deep Learning_22ISE74A_ISE_.pdf

DEEP LEARNING
(22ISE74A)
VII SEMESTER
Module-3
Dr. Shivashankar
Professor
Department of Information Science & Engineering
GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru
18-11-2025 1
GLOBAL ACADEMY OF TECHNOLOGY
Ideal Homes Township, Rajarajeshwari Nagar, Bengaluru – 560 098
Department of Information Science & Engineering
Dr. Shivashankar-ISE-GAT

Course Outcomes
After Completion of the course, student will be able to:
22ISE74A.1: Understand the fundamental issues and challenges of deep learning data, model selection, model
complexity etc.
22ISE74A.2: Describe various knowledge on deep learning and algorithms
22ISE74A.3: Apply CNN and RNN model for real time applications
22ISE74A.4: Identify various challenges involved in designing and implementing deep learning algorithms.
22ISE74A.5: Relate the deep learning algorithms for the given types of learning tasks in varied domain.
Text Book:
1. Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2016.
Reference books:
1. Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends in Machine Learning, 2009.
2. N. D. Lewis, “Deep Learning Made Easy with R: A Gentle Introduction for Data Science”, January 2016.
3. Nikhil Buduma, “Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms”, O’Reilly
publications.
4. Navin Kumar Manaswi ,Deep Learning with Applications Using Python Chatbots and Face, Object, and Speech Recognition
With TensorFlow and Keras, Apress,2018.
18-11-2025 2

Module -3: Optimization for Training Deep Models
• Optimization in deep learning refers to the methods used to adjust a neural network's parameters
(weights and biases) to minimize a loss function, thereby improving model accuracy and training
speed.
• This minimization aims to improve the model's performance by reducing the discrepancy between its
predictions and the actual target values in the training data.
How Learning Differs from Pure Optimization
• Optimization algorithms used for training of deep models differ from traditional optimization
algorithms in several ways. Machine learning usually acts indirectly.
• In most machine learning scenarios, we care about some performance measure P, that is defined
with respect to the test set and may also be intractable.
• We therefore optimize P only indirectly.
• We reduce a different cost function J(θ) in the hope that doing so will improve P.
• This is in contrast to pure optimization, where minimizing J is a goal in and of itself.
• Optimization algorithms for training deep models also typically include some specialization on the
specific structure of machine learning objective functions.
18-11-2025 3

Conti..
The cost function can be written as an average over the training set, such as
𝐽 𝜃 = 𝔼 𝑥,𝑦 ~ ෨
𝑃𝑑𝑎𝑡𝑎
𝐿(𝑓 𝑥; 𝜃 , 𝑦)
• where L is the per-example loss function, f (x; θ) is the predicted output when the input is x, ෨
𝑃𝑑𝑎𝑡𝑎 is
the empirical distribution.
• In the supervised learning case, y is the target output.
• To minimize the corresponding objective function where the expectation is taken across the data
generating distribution 𝑃𝑑𝑎𝑡𝑎 rather than just over the finite training set:
𝐽∗ 𝜃 = 𝔼 𝑥,𝑦 ~𝑃𝑑𝑎𝑡𝑎
𝐿(𝑓 𝑥; 𝜃 , 𝑦) (1)
Empirical Risk Minimization:
• The goal of a machine learning algorithm is to reduce the expected generalization error given by
equation (1).
This quantity is known as the risk.
• If we knew the true distribution 𝑃𝑑𝑎𝑡𝑎 (x, y), risk minimization would be an optimization task solvable
by an optimization algorithm.
18-11-2025 4

Conti..
• The simplest way to convert a machine learning problem back into an optimization problem is to
minimize the expected loss on the training set.
• This means replacing the true distribution p(x, y) with the empirical distribution ෨
𝑃(x, y) defined by the
training set.
• We now minimize the empirical risk
𝔼 𝑥,𝑦 ~ ෨
𝑃𝑑𝑎𝑡𝑎(𝑥,𝑦)[𝐿 𝑓 𝑥; 𝜃 , 𝑦 ] =
1
𝑚
෍
𝑖=1
𝑚
𝐿 𝑓 𝑥(𝑖)
; 𝜃 , 𝑦(𝑖)
• where m is the number of training examples.
• The training process based on minimizing this average training error is known as empirical risk
minimization.
• The most effective modern optimization algorithms are based on gradient descent, but many useful
loss functions, such as 0-1 loss, have no useful derivatives (the derivative is either zero or undefined
• everywhere).
18-11-2025 5

Conti..
Surrogate Loss Functions and Early Stopping
• Surrogate loss functions are approximations of true, often hard-to-optimize, loss functions used in
machine learning to make model training more tractable.
• In many cases, the ideal, or true, objective is difficult or impossible to optimize directly because it
lacks mathematically convenient properties like differentiability.
• The ultimate goal in classification is to minimize the 0-1 loss, which is 0 for a correct prediction and 1
for an incorrect one.
• For example, the test set 0-1 loss often continues to decrease for a long time after the training set 0-1
loss has reached zero, when training using the log-likelihood surrogate.
• Early stopping is a regularization technique in deep learning that prevents models from overfitting by
halting training when performance on a validation set stops improving.
• It involves monitoring a chosen metric, like validation loss, and saving the best model weights
encountered so far.
• It stops the model from memorizing noise and irrelevant patterns in the training data, leading to
better generalization.
18-11-2025 6

Conti…
Batch and Minibatch Algorithms
In machine learning, particularly within the context of gradient descent optimization(fundamental
optimization algorithm, especially in machine learning, that iteratively finds the set of parameters (like
weights and biases) to minimize a cost function), batch and mini-batch algorithms refer to how training
data is used to update model parameters.
• Optimization algorithms for machine learning typically compute each update to the parameters
based on an expected value of the cost function estimated using only a subset of the terms of the full
cost function.
• For example, maximum likelihood estimation problems, when viewed in log space, decompose into a
sum over each example:
𝜃𝑀𝐿 = arg max
𝜃
෍
𝑖=1
𝑚
log 𝑃𝑚𝑜𝑑𝑒𝑙(𝑥 𝑖
, 𝑦(𝑖)
; 𝜃)
• Maximizing this sum is equivalent to maximizing the expectation over the empirical distribution
defined by the training set:
𝐽 𝜃 = 𝔼 𝑥,𝑦 ~ ෨
𝑃𝑑𝑎𝑡𝑎
log 𝑃𝑚𝑜𝑑𝑒𝑙(𝑥, 𝑦; 𝜃]
• Most of the properties of the objective function J used by most of our optimization algorithms are
also expectations over the training set.
18-11-2025 7

Conti..
Minibatch sizes are generally driven by the following factors:
• Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
• Multicore architectures are usually underutilized by extremely small batches. This motivates using
some absolute minimum batch size, below which there is no reduction in the time to process a
minibatch.
• If all examples in the batch are to be processed in parallel (as is typically the case), then the amount
of memory scales with the batch size.
• Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using
Graphics processing Units (GPUs), it is common for power of 2 batch sizes to offer better runtime.
Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large
models.
• Small batches can offer a regularizing effect, perhaps due to the noise they add to the learning
process. Generalization error is often best for a batch size of 1. Training with such a small batch size
might require a small learning rate to maintain stability due to the high variance in the estimate of the
gradient. The total runtime can be very high due to the need to make more steps, both because of the
reduced learning rate and because it takes more steps to observe the entire training set.
18-11-2025 8

Challenges in Neural Network Optimization
• Optimization in general is an extremely difficult task.
• Traditionally, machine learning has avoided the difficulty of general optimization by carefully
designing the objective function and constraints to ensure that the optimization problem is convex.
• When training neural networks, we must confront the general non-convex case.
• Even convex optimization is not without its complications.
Ill-Conditioning:
▪ It refers to a problem where the Hessian matrix (contains the second-order partial derivatives of the
network's error or loss function with respect to its weights and biases) of the cost function has a very
large range between its maximum and minimum eigenvalues, making the optimization process highly
sensitive to small errors in the gradient.
▪ This causes the training to become unstable, slow, and potentially "stuck," as small parameter
updates can lead to disproportionately large changes in the cost function, often resulting in the loss
of numerical precision and difficulty in finding the true minimum.
▪ The ill-conditioning problem is generally believed to be present in neural network training problems.
▪ Ill-conditioning can manifest by causing Stochastic Gradient Descent (SGD) to get “stuck”in the
sense that even very small steps increase the cost function.
18-11-2025 9

Conti..
• A second-order Taylor series expansion of the cost function predicts that a gradient descent step of
−𝜖g will add
1
2
𝜖2𝑔𝑇Hg - 𝜖𝑔𝑇𝑔 to the cost.
• Ill-conditioning of the gradient becomes a problem when
1
2
𝜖2𝑔𝑇Hg exceeds 𝜖𝑔𝑇𝑔 .
• To determine whether ill-conditioning is detrimental to a neural network training task, one can
monitor the squared gradient norm 𝑔𝑇Hg and In many cases, the gradient norm does not shrink
significantly throughout learning, but the 𝑔𝑇Hg term grows by more than an order of magnitude.
• The result is that learning becomes very slow despite the presence of a strong gradient because the
learning rate must be shrunk to compensate for even stronger curvature.
18-11-2025 10
Figure 8.1: Gradient descent often does not
arrive at a critical point of any kind.

Conti..
Local Minima:
• Its a point in an error or loss landscape where the model's performance is lower than all nearby
points, but not necessarily the best possible performance overall.
• One of the most prominent features of a convex optimization problem is that it can be reduced to the
problem of finding a local minimum.
• During training, optimization algorithms like gradient descent can converge to these local minima,
which are like small valleys on a hilly landscape the lowest point on the entire landscape and the
true optimal solution.
• Any local minimum is guaranteed to be a global minimum.
• Some convex functions have a flat region at the bottom rather than a single global minimum point,
but any point within such a flat region is an acceptable solution.
• When optimizing a convex function, we know that we have reached a good solution if we find a
critical point of any kind.
• With non-convex functions, such as neural nets, it is possible to have many local minima.
• Indeed, nearly any deep model is essentially guaranteed to have an extremely large number of local
minima.
18-11-2025 11

Plateaus, Saddle Points and Other Flat Regions
• A plateau is a region on the loss surface where the gradient becomes very small or nearly flat.
• Training slows down significantly as the gradient becomes too small to provide a strong direction for
improvement.
https://www.youtube.com/watch?v=DKyUY1tnRRM
• Saddle Points: A critical point on a surface where the gradient is zero, but the surface is flat in some
directions and steep in others.
• Standard gradient descent can get stuck at a saddle point because the gradient is zero, even though
it is not a minimum, says Medium.
General Flat Regions (including Plateaus):
• This refers to any region where the surface is nearly horizontal or very gently sloped.
• These flat regions can also cause optimization algorithms to stall, similar to plateaus, leading to
slower convergence or premature stopping of training,
• The presence of these flat regions further complicates the optimization process, making it harder to
find the true global minimum of the cost function.
18-11-2025 12

Conti..
Cliffs and Exploding Gradients:
• "Cliffs" refer to extremely steep regions of a neural network's loss surface, often alongside "valleys,"
which can make training unstable by causing large, potentially unstable updates to weights
during backpropagation.
• This is a manifestation of exploding gradients, where gradient values grow exponentially large,
leading to drastic weight changes that can make the optimizer overshoot a minimum, oscillate, or
even result in NaN (Not a Number) values and a training crash.
• These are sharp, cliff-like drops in the loss function, creating steep walls in the parameter space
where gradients become extremely large.
• Exploding Gradients:
▪ Its is a problem in deep neural network training where the gradients of the loss function with respect
to the network's weights become excessively large, growing exponentially during backpropagation.
▪ This leads to very large and unstable weight updates, causing the training process to diverge and the
model to be unable to learn effectively.
▪ Its typically occur in deep or recurrent neural networks, often due to factors like inappropriate
hyperparameters (e.g., a high learning rate), an incorrect model architecture, or poor weight
initialization.
18-11-2025 13

Basic Algorithms
https://www.youtube.com/watch ?v=UmathvAKj80t=26s
Algorithm 8.1 Stochastic gr
Stochastic Gradient Descent (SGD):
• An iterative optimization algorithm that updates model parameters using a small batch or a single
data point at a time to minimize a loss function, offering a more computationally efficient alternative
to standard gradient descent.
• This "stochastic" approach makes SGD faster and more scalable for large datasets by providing noisy
but quicker updates, which can also help the model avoid getting stuck in poor local minima.
• https://www.youtube.com/watch?v=UmathvAKj80&t=26s
• Algorithm 8.1 Stochastic gradient descent (SGD) update at training iteration k
• Require: Learning rate 𝜖𝑘.
• Require: Initial parameter θ
• while stopping criterion not met do
• Sample a minibatch of m examples from the training set {𝑥(1),….𝑥(𝑚)} with corresponding targets 𝑦(𝑖).
• Compute gradient estimate: ො
𝑔 ←+
1
𝑚
∇θ σ𝑖 𝐿(𝑓(𝑥(𝑖)
; θ), 𝑦(𝑖)
)
• Apply update: θ ← θ − 𝜖 ො
𝑔
• end while
18-11-2025 14

Conti…
• A crucial parameter for the SGD algorithm is the learning rate. Previously, we have described SGD as
using a fixed learning rate 𝜖.
• In practice, it is necessary to gradually decrease the learning rate over time, so we now denote the
learning rate at iteration k as 𝜖𝑘.
• A sufficient condition to guarantee convergence of SGD is that
෍
𝑘=1
∞
𝜖𝑘 = ∞ 𝑎𝑛𝑑
෍
𝑘=1
∞
𝜖𝑘
2
< ∞
In practice, it is common to decay the learning rate linearly until iteration τ:
𝜖𝑘 = (1 − 𝛼) 𝜖0 + 𝛼𝜖τ
with 𝛼 =
𝑘
τ
. After iteration τ , it is common to leave 𝜖 constant.
• The learning rate may be chosen by trial and error, but it is usually best to choose it by monitoring
learning curves that plot the objective function as a function of time.
18-11-2025 15

Conti..
Momentum
• Momentum-based gradient optimizers are advanced techniques used to enhance the training of
machine learning models.
• Unlike classic gradient descent, they incorporate a "momentum" term that helps the optimizer
navigate the loss surface more efficiently.
• The method of momentum is designed to accelerate learning, especially in the face of high
curvature, small but consistent gradients, or noisy gradients.
• The momentum algorithm accumulates an exponentially decaying moving average of past gradients
and continues to move in their direction.
• Formally, the momentum algorithm introduces a variable v that plays the role of velocity—it is the
direction and speed at which the parameters move through parameter space.
• A hyperparameter 𝛼∈ [0, 1) determines how quickly the contributions of previous gradients
exponentially decay.
The update rule is given by:
𝑣 ← 𝛼𝑣 − 𝜖∇𝜃
1
𝑚
෍
𝑖=1
𝑚
𝐿 𝑓 𝑥 𝑖
; 𝜃 , 𝑦 𝑖
𝑎𝑛𝑑 𝜃 ← 𝜃 + 𝑣
18-11-2025 16

Conti..
Algorithm 8.2 Stochastic gradient descent (SGD) with momentum
----------------------------------------------------------------------------------------------------------------------
• Require: Learning rate 𝜀, momentum parameter 𝛼.
• Require: Initial parameter θ, initial velocity v.
• Sample a minibatch of m examples from the training set {𝑥 1
, . . . ,𝑥 𝑚
} with corresponding targets
𝑦 𝑖 .
• Apply interim update: ෨
𝜃← θ + 𝛼v
• Compute gradient (at interim point): g ←
1
𝑚
∇ ෨
𝜃 σ𝑖 𝐿 𝑓 𝑥 𝑖 ; ෨
𝜃 , 𝑦 𝑖
• Compute velocity update: v← 𝛼v − 𝜖g
• Apply update: θ ← θ + v
• end while
-------------------------------------------------------------------------------------------------------------------------
18-11-2025 17

Conti…
Nesterov Momentum:
An optimization technique that improves upon standard gradient descent by calculating the gradient at
an extrapolated position, rather than the current one, to make more intelligent steps toward the
minimum of a cost function.
Used to enhance training speed and stability in neural networks.
• The update rules in this case are given by:
𝑣 ← 𝛼𝑣 − 𝜖∇𝜃
1
𝑚
෍
𝑖=1
𝑚
𝐿 𝑓 𝑥 𝑖 ; 𝜃 + 𝛼𝑣 , 𝑦 𝑖
and θ ← θ + v
• where the parameters α and  play a similar role as in the standard momentum method.
• The difference between Nesterov momentum and standard momentum is where the gradient is
evaluated.
• With Nesterov momentum the gradient is evaluated after the current velocity is applied.
• Thus one can interpret Nesterov momentum as attempting to add a correction factor to the standard
method of momentum.
18-11-2025 18

Conti…
Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum
-------------------------------------------------------------------------------------------------------------------------
• Require: Learning rate 𝜀, momentum parameter 𝛼.
, . . . ,𝑥 𝑚
𝑦 𝑖
• Compute gradient estimate: g ←
1
𝑚
∇θσ𝑖 𝐿 𝑓 𝑥 𝑖
; 𝜃 , 𝑦 𝑖
• Compute velocity update: v← 𝛼v − 𝜖g
• end while
----------------------------------------------------------------------------------------------------------------------------------
18-11-2025 19

Parameter Initialization Strategies
Parameter initialization strategies set initial values for deep learning model weights and biases to
improve training, preventing vanishing/exploding gradients and ensuring convergence.
• Training algorithms for deep learning models are usually iterative in nature and thus require the user
to specify some initial point from which to begin the iterations.
• Moreover, training deep models is a sufficiently difficult task that most algorithms are strongly
affected by the choice of initialization.
• The initial point can determine whether the algorithm converges at all, with some initial points being
so unstable that the algorithm encounters numerical difficulties and fails altogether.
• When learning does converge, the initial point can determine how quickly learning converges and
whether it converges to a point with high or low cost.
• Also, points of comparable cost can have wildly varying generalization error, and the initial point can
affect the generalization as well.
• Key methods include zero initialization (avoided due to symmetry), random initialization (a good
default), and sophisticated methods like Xavier (Glorot) initialization for sigmoid/tanh activation
functions and He (Kaiming) initialization for ReLU activations, which are tailored to preserve variance
across layers.
18-11-2025 20

Algorithms with Adaptive Learning Rates
https://www.youtube.com/watch?v=MaNVmSCJBG4&t=6s
Adaptive learning rate methods are essential for optimizing deep learning models, as they help in
automatically adjusting the learning rates during the training process.
These methods have gained popularity due to their ability to ease the burden of selecting appropriate
learning rates and initialization strategies for deep neural networks.
AdaGrad:
• AdaGrad (Adaptive Gradient) is a deep learning optimizer that adapts the learning rate for each
parameter individually by accumulating squared gradients over time, leading to larger updates for
infrequent parameters and smaller updates for frequent ones.
• It excels with sparse data, such as in natural language processing and image recognition, because it
dynamically scales the learning rate, making it faster to learn rare features and slowing down learning
for common ones.
• Individually adapts the learning rates of all model parameters by scaling them inversely proportional
to the square root of the sum of all of their historical squared values.
• The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease
in their learning rate, while parameters with small partial derivatives have a relatively small decrease
in their learning rate.
18-11-2025 21

Conti..
Algorithm 8.4 The AdaGrad algorithm
------------------------------------------------------------------------------------------------------------
• Require: Global learning rate 𝜖
• Require: Small constant δ, perhaps 10−7, for numerical stability
• Initialize gradient accumulation variable r = 0
, . . . ,𝑥 𝑚
𝑦 𝑖 .
1
𝑚
∇ θ σ𝑖 𝐿 𝑓 𝑥 𝑖
; 𝜃 , 𝑦 𝑖
• Accumulate squared gradient: r ← r + g ⊙ g
• Compute update: Δθ ← -
𝜀
𝛿+ 𝑟
⊙ g. (Division and square root applied element-wise)
• Apply update: θ ← θ + Δθ
• end while
18-11-2025 22

Conti…
RMSProp:
• Root Mean Square Propagation(RMSProp), is an adaptive learning rate optimization algorithm in deep
learning that adjusts the learning rate for each parameter individually based on the magnitude of
recent gradients, helping to prevent exploding and vanishing gradients.
• Instead of a fixed learning rate, RMSProp uses an adaptive one that changes based on the gradients,
becoming smaller for large gradients and larger for small ones.
Algorithm 8.5 The RMSProp algorithm
----------------------------------------------------------------------------------------------------------------------------------
• Require: Global learning rate 𝜖, decay rate 𝜌
• Require: Small constant 𝛿, usually 10−6, used to stabilize division by small numbers.
• Initialize gradient accumulation variable r = 0
18-11-2025 23

Conti..
, . . . ,𝑥 𝑚
𝑦 𝑖 .
1
𝑚
∇ θ σ𝑖 𝐿 𝑓 𝑥 𝑖 ; 𝜃 , 𝑦 𝑖
• Accumulate squared gradient: r ← 𝜌r + (1 − 𝜌) g ⊙ g
• Accumulate squared gradient: r ← r + g ⊙ g
• Compute update: Δθ ← -
𝜀
𝛿+ 𝑟
⊙ g. (-
𝜀
𝛿+ 𝑟
applied element-wise).
• Apply update: θ ← θ + Δθ
• end while
-----------------------------------------------------------------------------------------------------------------------------------------------
• RMSProp also incorporates an estimate of the (uncentered) second-order moment, however it lacks
the correction factor.
• Thus, unlike in Adam, the RMSProp second-order moment estimate may have high bias early in
training.
18-11-2025 24

Conti..
Choosing the Right Optimization Algorithm
• Each seek to address the challenge of optimizing deep models by adapting the learning rate for each
model parameter.
• Currently, the most popular optimization algorithms actively in use include SGD, SGD with
momentum, RMSProp, RMSProp with momentum, AdaDelta and Adam.
• The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity
with the algorithm (for ease of hyperparameter tuning).
Algorithm 8.6 RMSProp algorithm with Nesterov momentum
---------------------------------------------------------------------------------------------------------------------------
• Require: Global learning rate 𝜖, decay rate 𝜌, momentum coefficient 𝛼.
• Require: Small constant δ, usually 10−6, used to stabilize division by small numbers. Initialize
gradient accumulation variable r = 0
18-11-2025 25

Conti..
Sample a minibatch of m examples from the training set {𝑥 1
, . . . ,𝑥 𝑚
𝑦 𝑖 .
Compute interim update: ෨
𝜃← θ + 𝛼v
Compute gradient: g ←
1
𝑚
∇ ෨
𝜃 σ𝑖 𝐿 𝑓 𝑥 𝑖 ; ෨
𝜃 , 𝑦 𝑖
Accumulate gradient: r ← 𝜌r + (1 − 𝜌) g ⊙ g
• Compute velocity update: v ← 𝛼v −
𝜀
𝑟
⊙ g. (
1
𝑟
applied element-wise)
• end while
----------------------------------------------------------------------------------------------------------------------------------
18-11-2025 26

Module-3_Deep Learning_22ISE74A_ISE_.pdf

More Related Content

Similar to Module-3_Deep Learning_22ISE74A_ISE_.pdf

More from Dr. Shivashankar

Recently uploaded

Module-3_Deep Learning_22ISE74A_ISE_.pdf