2. ANN - IMITATE HUMAN BRAIN!!!
• ANN’s are computational models inspired by
the human brain's structure and function.
• ANN acts as a brain. It has various layers
which are interconnected to each other such
as the input layer and the hidden layer
3. BIOLOGICAL NEURON
• Biological Neurons are of many different types
• Dendrites receive signals from adjacent neurons, the cell
body processes the input and
• Generates activations that are passed through the axon to
the other neurons.
• Cell can perform complex non-linear computations
• Synapses are not a single weight but a complex non-linear
dynamical system
4. PERCEPTRON
• A perceptron is the simplest form of an artificial
neuron and the fundamental building block of
neural networks.
• Introduced by Frank Rosenblatt in 1958, it’s a
mathematical model inspired by biological
neurons.
• It’s a binary classifier that makes a decision by
mapping input features to an output (0 or 1).
5. • Input Features: Multiple features representing
input data characteristics.
• Weights: Each feature is assigned a weight
determining its influence on output.
• Summation Function: Calculates weighted sum of
inputs, combining them with respective weights.
• Activation Function: Passes weighted sum through
Heaviside step function to produce binary output (0
or 1).
Basic Components of Perceptron
6. • Output: Determined by activation function,
often used for binary classification tasks.
• Bias: Helps make adjustments independent of
input, improving learning flexibility.
• Learning Algorithm: Adjusts weights and bias
using a learning algorithm like Perceptron
Learning Rule.
Basic Components of Perceptron
7. • Artificial Neuron connected with an input vector x=[x1 x2 x3 x4]
• Input is multiplied with a weight vector w=[w1 w2 w3 w4]
• The weighted input is summed up, weighted sum = x1* w1 + x2* w2 +
x3* w3 + x4* w4
• A bias term 1 is added to form a net sum. Net sum = x1* w1 + x2* w2
+ x3* w3 + x4* w4+ 1
•Net sum is finally passed through an activation function to produce
the final output neuron
How does a Perceptron Work?
8. UNIT - STEP FUNCTION
A simple activation function to produce a binary output from a given input
n. We can define it as follows, w.r.t a chosen threshold t
f(n) = 0 for n < t = 1, otherwise
9. Step 1: Compute the weighted
sum:
HOW DOES A PERCEPTRON WORK?
Step 2: Apply activation (step
function):
Training the
Perceptron:
• Initialize: Start with weights W = 0 and bias b =
0
• For Each Data Point:
11. ACTIVATION FUNCTIONS
Sigmoid: sigma(x) = 1/(1+e x)
− outputs 0 to 1, used for binary classification (e.g., spam detection).
Graph shows a smooth S-curve.
Softmax: softmax(xi
) = exi
/ (exj
)
∑ for all j) outputs probabilities (summing to 1), used for multi-class
classification (e.g., image labeling).
Tanh: tanh(x) = 2 / 1 + e—2x outputs -1 to 1, similar to Sigmoid but centered at 0, good for
balanced data. Graph shows a steeper S-curve.
Leaky ReLU: it is a variant of the ReLU that addresses the dying neurons issue.
Leaky ReLU makes provision for non-zero output even for negative values by introducing a small
slope.
ReLU: f(x) = max (0,x) outputs the input value if it is positive or returns zero. In simple words, ReLU
only returns a positive output or zero.
12. MULTILAYER PERCEPTRON
• The input X is fed into the first layer, which is a
multidimensional perceptron with a weight
matrix W1 and bias vector b1.
• The output of that layer is then fed into the
second layer, which is again a perceptron with
another weight matrix W2 and bias vector b2.
• This process continues for every of the L layers
until we reach the output layer. We refer to the
last layer as the output layer and to every other
layer as a hidden layer.
13. FORWARD PROPAGATION IN MLP
• An MLP with one hidden layers computes the
function
• an MLP with two hidden layers computes the
function
• Like this, during forward propagation the
activations/output from neurons of one layers
are passed to the neurons of the next layers.
• and, generally, an MLP with L 1 hidden layers
−
computes the function
14. BACKWARD PROPOGATION IN MLP
• During Backward Propagation (during Training) - error of the loss function is calculated at
the output layer.
• We need to minimize the amount of error made by the neural network. This is done
through an iterative process known as training.
• The error of the loss function can be minimized by updating the network parameters
(weights and biases) of the connections between neurons.
• The gradients of the loss function is calculated w.r.t the weights using Backpropagation
algorithm.
• Applying Gradient Descent, the parameter update is performed (updating weights and
biases).
• This steps are repeated up to a fixed number of iterations or until the difference of errors
in two successive iterations is less than a predetermined threshold (meeting the
16. COMPUTING NET INPUT TO NEURONS AT HIDDEN LAYER
The image depicts a neural network layer
where biases b1, b2, b3 are added to
hidden nodes h1, h2, h3 alongside
weighted inputs from x1, x2, x3. These
biases shift the activation function, helping
the network better fit the data by adjusting
the output of each neuron.
17. COMPUTING NET INPUT TO NEURONS AT HIDDEN LAYER
Passes through an activation
function to generate the
output of the neurons.
18. WEIGHTS BETWEEN INPUT-HIDDEN LAYERS
The connection between input and hidden
layers in a neural network, where inputs
x1, x2, x3 are linked to hidden nodes h1,
h2, h3, h4 via weights. The weight matrix
defines the strength of these
connections, crucial for transforming input
data in the network.
19. COMPUTING NET INPUT TO NEURONS AT OUTPUT LAYER
The image says the backward propagation in
a neural network, where the error between
predicted outputs o1, o2 and actual targets is
calculated and propagated backward from the
output layer to hidden layers and inputs x1,
x2, x3. This process adjusts weights and
biases to minimize the error, enabling the
network to learn effectively.
20. COMPUTING NET INPUT TO NEURONS AT OUTPUT LAYER
• The diagram shows an artificial Neuron
connected with an input vector x=[x1 x2 x3 x4]
• The input is multiplied with a weight vector
w=[w1 w2 w3 w4]
• The weighted input is summed up, weighted sum
= x1* w1 + x2* w2 + x3* w3 + x4* w4
• A bias term 1 is added to form a net sum. Net
sum = x1* w1 + x2* w2 + x3* w3 + x4* w4+ 1
• The Net sum is finally passed through an
activation function to produce the final output
neuron
21. IMPORTANT POINTS TO REMEMBER ABOUT
MLP
MLPs are connectionist computational models
• We can solve classification and regression problems
• MLPs can compose Boolean functions
• MLPs can compose real-valued functions
• MLPs are Universal function approximators (Universal approximation Theorem)
• MLPs can represent any function if
> it is sufficiently wide (number of neurons in a hidden layer)
> it is sufficiently deep (number of hidden layers)
> depth can be traded off for (sometimes) exponential growth of the width of the network
• Optimal width and depth depend on the number of input variables and the complexity of the
function it is trying to model
22. ANN – REGRESSION AND CLASSIFICATION
In regression, ANNs aim to predict continuous numerical values, while in
classification, they aim to categorize data into discrete classes.
23. • Goal: To predict a continuous numerical output.
• Example: Predicting house prices, stock prices,
or temperature.
• Output Layer: A linear activation function is
typically used in the output layer.
• Loss Function: Mean Squared Error (MSE) is
commonly used as the loss function.
• Applications: Predicting quantities, estimating
values, and forecasting trends.
REGRESSION
• Goal: To categorize data into discrete classes or
categories.
• Example: Classifying emails as spam or not spam,
recognizing handwritten digits, or identifying
image objects.
• Output Layer: A Softmax activation function is
used to produce probabilities for each class.
• Loss Function: Cross-entropy loss is typically used
as the loss function.
• Applications: Pattern recognition, object
detection, and sentiment analysis.
CLASSIFICATION
24. LOSS FUNCTIONS
• Measures how good or bad model predictions are compared to actual
results
• Outputs a single number showing error magnitude — smaller is better
• Used to guide model training (e.g., via Gradient Descent)
• Helps evaluate model performance and influences learning behavior
Why are Loss Functions
Important?
• Guide the optimization of model parameters
• Measure difference between predicted and true values
• Different loss functions suit different tasks and affect model learning
25. REGRESSION LOSS FUNCTIONS
Used for predicting continuous values (e.g., price,
age)
MEAN SQUARED ERROR (MSE) LOSS MEAN ABSOLUTE ERROR (MAE)
LOSS
HUBER
LOSS
• Average of squared
differences between
predicted and actual values
• Sensitive to outliers
• Average of absolute
differences between
predicted and actual values
• Less sensitive to outliers but
not differentiable at zero
• Combines MSE and MAE
benefits
• Less sensitive to outliers and
differentiable everywhere
• Requires tuning of parameter δ
26. CLASSIFICATION LOSS FUNCTIONS
Used for evaluating how well predicted class labels match actual
labels
BINARY CROSS-ENTROPY LOSS
(LOG LOSS)
CATEGORICAL CROSS-
ENTROPY LOSS
SPARSE CATEGORICAL CROSS-
ENTROPY LOSS
• For binary classification (0 or
1)
• Measures difference between
predicted probabilities and
actual labels
• For multi-class
classification with one-hot
encoded labels
Similar to Categorical Cross-
Entropy but uses integer
labels (not one-hot)
29. • Cost function and Loss function are
synonymous and used interchangeably.
Conceptually they are slightly different.
• Loss/Error is what we compute for a single
training example/input sample.
• A cost function, on the other hand, is the
average loss/error over the entire training
dataset (batch/mini-batch).
• The optimization algorithms aim at
minimizing the cost function".
CONTOUR PLOT OF THE LOSS/ERROR SURFACE
30. AVOIDING LOCAL MINIMA!!!
If the learning rate is too large,
we can overshoot
If the learning rate is too
small,
convergence is very slow
31. RANDOM WEIGHT
INITIALIZATION
• Random weight initialization ensures diverse
starting points on the loss surface.
• Training is empirical — multiple runs help reach
better local minima.
• Loss functions are often non-convex multiple
→
local minima.
• In high-dimensional spaces, gradients rarely fall
to exact zero we approximate minima.
→
• Local optima often yield good results; searching
for global minima is unnecessary and
computationally expensive.
33. GRADIENT DESCENT AND ITS VARIANTS
GRADIENT DESCENT
Batch Gradient
descent
Stochastic Gradient
descent
Mini-batch
Gradient descent
Entire dataset for updation Single observation for
updation
Subset of data for updation
34. BATCH, SDG & MINI-BATCH GD
• Batch Gradient Descent uses the entire dataset to
compute the gradient and update parameters,
providing a smooth but sometimes slow convergence
path.
• Stochastic Gradient Descent (SGD) updates
parameters using one data point at a time, resulting in
a noisy but faster and more frequent update, which
helps escape local minima.
• Mini-Batch Gradient Descent strikes a balance by
using small batches of data for each update, combining
efficiency and stable convergence.
35. BATCH, SDG & MINI-BATCH GD
• Gradient descent involves computing the average over all
n examples, which can be time-consuming when the
training set grows.
• Mini-batch Gradient Descent avoids this by sampling a
mini-batch M of n′ examples from the training set.
• The gradient estimate is computed from all pairs of
examples in the mini-batch M, not from the training set.
• This algorithm is known as Stochastic Gradient Descent
(SGD) when n' contains just one training example.
36. BATCH, SDG & MINI-BATCH GD
If mini-batch size = 1: Stocastic Gradient Descent
= every example (row) it is used as mini-batch:
lose sped-up from vectorization.
If mini-batch size = m: Batch Gradient
Descent: (xu), YU}) = (X, Y) = the entire
training set is used:
size = n: too long interations.
37. ADAPTIVE MOMENTUM (ADAM)OPTIMIZER
Optimization algorithm that combines Momentum and RMSProp to adaptively adjust
learning rates during training. It is efficient, requires minimal tuning, and performs
well on large and complex datasets.
38. EFFECT OF THE LR ON TRAINING
• The learning rate (LR) controls how much the
model's weights are updated during training.
• A high LR may cause the model to overshoot the
minimum, while a low LR can lead to slow
convergence or getting stuck in local minima.
39. PROJECT DEMO
This project predicts whether a user will generate revenue based on behavioral and technical
features using an Artificial Neural Network (ANN). The model is trained on a customer behavior
dataset.
Customer Purchase
Prediction
Datase
t:
Dataset Source: UCI Online Shoppers Purchasing Intention
Dataset
Target Variable: Revenue (Yes / No)
Total Samples: ~12,330 records
Classes: 0 No Revenue , 1 Revenue Generated
→ →
Train-Test Split: Training Samples: 80%, Testing Samples: 20%
Note: Dropped less relevant features, encoded categorical data, and applied StandardScaler for
normalization.
48. Model Architecture (ANN)
output
The model has 5 dense layers with
ReLU activation and dropout layers
to reduce overfitting. The final dense
layer uses sigmoid for binary
classification. Total trainable
parameters: 86,785.
49. Model Compilation & Training
Compile and train the ANN using Adam
optimizer.
51. WHAT IS CNN?
Convolutional Neural Networks (CNNs)
are deep learning models designed to recognize patterns in images. They
use spec
ial layers to automatically detect features like edges and shapes, making th
em very effective for tasks like image classification and object detection.
52. MOTIVATION BEHIND USING
CNN
1. Convolution leverages three important ideas to improve ML systems:
i) Sparse interactions
ii) Parameter sharing
iii) Equivariant representations
2. Convolution also allows for working with inputs of variable size
53. SPARSE INTERACTION
Sparse interaction means each neuron
in a layer connects only to a small
region of the previous layer (called the
local receptive field), instead of all
neurons. This helps CNNs focus on
important local features in images
without using too many connections.
Sparse
Connectivity
Dense
Connectivity
54. SPARSE INTERACTION
• Sparse Connectivity Local receptive fields
→
Each neuron in layer s connects only to a few neighboring inputs (x).
Example: s3 connects only to x2, x3, and x4, not all inputs. This is
the local receptive field idea.
• Local Neighbourhood Processing
Because each neuron in s sees only a small region of the input
space, it’s only looking at nearby pixels/features.
• Efficient Learning Reduced Input Size
→
Sparse connections mean fewer computations and less memory,
which helps models learn faster and scale better.
Key
Features:
55. SPARSE INTERACTION
This diagram shows sparse connectivity across
multiple layers, which is common in Convolutional
Neural Networks (CNNs). Each neuron is connected
only to a small, localized group of neurons from the
previous layer, forming local receptive fields. As we
move deeper through the layers (x h g), the
→ →
receptive field increases, allowing the network to
understand larger and more complex patterns while
keeping the number of connections and
computations efficient. This layered structure
enables CNNs to extract both local and global
features effectively.
56. PARAMETER SHARING
• Same parameters (weights) are reused across different
parts of the input.
• Common in Convolutional Neural Networks (CNNs).
• A kernel/filter slides across the image and applies the
same weights.
• Helps in detecting the same feature (like edges or eyes)
anywhere in the image.
• Reduces memory usage and improves efficiency.
• Less parameters to learn faster training and better
→
generalization.
The same filter detects both eyes — even though
they’re in different places
57. PARAMETER SHARING
• In CNNs, the same filter (kernel) is applied to every position of the input
image.
• This means each weight in the filter is reused across the entire input.
Suppose:
Input size = m × n
Kernel size = k × k
In a fully connected layer, parameters = m × n
In a convolutional layer, parameters = k × k (independent of
input size)
So, parameter count is reduced from m × n k × k
→
• Runtime of forward propagation remains O(k × n) (efficient)
• But storage requirement drops to just k × k parameters (much
smaller)
58. PARAMETER SHARING
The figure shows, the black arrows show how a single
weight from a 3-element kernel is applied across
different locations in the input. In contrast, the
bottom part shows a fully connected model where
each weight is used only once for one connection.
Since fully connected layers do not share parameters,
they need many more weights compared to
convolutional layers.
59. EDGE DETECTION
• Edge detection is used to highlight the boundaries in an image — like where one object ends
and another begins.
• It is done using special filters (kernels) that detect changes in pixel intensity.
• Two common filters are:
Mx - for detecting vertical edges
My - for detecting horizontal edges
• CNNs have revolutionized edge detection by learning hierarchical features directly from data.
• Traditional methods like Sobel and Canny rely on handcrafted filters, while CNN-based
methods learn optimal filters during training.
60. EDGE DETECTION
• These filters are applied (convolved)
with the input image.
• The result is two output images
showing detected edges in
horizontal and vertical directions.
• These are useful in tasks like object
detection, segmentation, and
feature extraction in CNNs.
61. EDGE DETECTION
In Convolutional Neural Networks
(ConvNets), edge detection identifies
patterns by matching small image sections
with a filter. Blue and green boxes mark
where the filter aligns with the image,
spotting edges between black and white
squares. This helps the network extract key
features for tasks like image classification.
62. SOBEL FILTERS
Sobel filters are special 3x3 kernels used in image
processing to detect edges in an image. They work by
emphasizing differences in pixel values, which helps
identify boundaries between objects (edges). There
are two main types:
1.Horizontal Sobel Filter: Detects horizontal edges
by looking for changes in pixel intensity along the
vertical direction.
2. Vertical Sobel Filter: Detects vertical edges by
looking for changes in pixel intensity along the
horizontal direction.
63. EQUIVARIANCE OF CONVOLUTION TO TRANSLATION
Translational Equivariance or just
equivariance is a very important
property of the convolutional
neural networks where the
position of the object in the image
should not be fixed for it to be
detected by the CNN.
64. EQUIVARIANCE OF CONVOLUTION TO TRANSLATION
• Equivariant means that if the input changes, the output changes in the same way.
• A function f(x) is equivariant to a function g if f(g(x))=g(f (x))
• If g is a function that translates the input, i.e., that shifts it, then the convolution
function is equivariant to g. I(x,y) is image brightness at point (x,y)
• I’=g(I) is image function with I’(x,y)=I(x-1,y), i.e., shifts every pixel of I one unit to
the right
• If we apply g to I and then apply convolution, the output will be the same as if we
applied convolution to I’, then applied transformation g to the output.
65. EQUIVARIANCE OF CONVOLUTION TO TRANSLATION
• Equivariant means that if the input changes, the output changes in the same way.
• A function f(x) is equivariant to a function g if f(g(x))=g(f (x))
• If g is a function that translates the input, i.e., that shifts it, then the convolution
function is equivariant to g. I(x,y) is image brightness at point (x,y)
• I’=g(I) is image function with I’(x,y)=I(x-1,y), i.e., shifts every pixel of I one unit to
the right
• If we apply g to I and then apply convolution, the output will be the same as if we
applied convolution to I’, then applied transformation g to the output.
66. CONVOLUTION: THE MATH BEHIND THE MATCH
• Line up the feature and the image patch.
• Multiply each image pixel by the corresponding feature/filter pixel
(element-wise product).
• Add them up.
• Divide by the total number of pixels in the feature.
67. FILTERING: THE MATH BEHIND THE
MATCH
A kernel (small grid) matches with a part of the
image by multiplying their values. The green 2x2
kernel (1, 1, -1, 1) overlaps with a yellow 2x2 image
patch (1, 1, 1, 1). Multiply each pair: 1×1 + 1×1 + (-1)×1
+ 1×1 = 1 + 1 - 1 + 1 = 2. This result (2) shows how well
the kernel matches the image patch, helping detect
features like edges.
68. FILTERING: THE MATH BEHIND THE
MATCH
After multiplying and adding the kernel
(1, 1, -1, 1) with the image patch (1, 1, 1,
1), the result is 2. This value goes into a
new grid called a feature map, shown in
the blue box (top-left corner). As the
kernel slides over the entire image, it
fills the feature map with values,
highlighting where patterns like edges
are found.
69. FILTERING: THE MATH BEHIND THE
MATCH
The kernel (1, 1, -1, 1) slides to the next
image patch (1, -1, 1, 1). Multiply and
add: 1×1 + 1×(-1) + (-1)×1 + 1×1 = 1 - 1 - 1
+ 1 = 0. This result (0) is placed in the
feature map, next to the previous value
(1). The kernel keeps sliding to fill the
feature map with values that show
where patterns match.
70. FILTERING: THE MATH BEHIND THE
MATCH
The kernel (1, 1, -1, 1) moves to the next
image patch (1, -1, 1, 1). Multiply and
add: 1×1 + 1×(-1) + (-1)×1 + 1×1 = 1 - 1 - 1
+ 1 = 0. This result (0) is added to the
feature map, following the previous
values (1, 1). The kernel keeps sliding
across the image to build the feature
map, showing where patterns match.
71. FILTERING: THE MATH BEHIND THE
MATCH
The kernel (1, 1, -1, 1) slides to the next
image patch (1, 1, 1, 1). Multiply and
add: 1×1 + 1×1 + (-1)×1 + 1×1 = 1 + 1 - 1 +
1 = 2. This result (2) is added to the
feature map, following the previous
values (1, 1, 1). The kernel keeps moving
across the image to complete the
feature map, revealing pattern matches.
72. FILTERING: THE MATH BEHIND THE
MATCH
The kernel (1, 1, -1, 1) slides to the next
image patch (1, 1, 1, 1). Multiply and
add: 1×1 + 1×1 + (-1)×1 + 1×1 = 1 + 1 - 1 +
1 = 2. This result (2) is added to the
feature map, following the previous
values (1, 1, 1, 1). The kernel keeps
sliding to fill the feature map, showing
where patterns match.
73. FILTERING: THE MATH BEHIND THE
MATCH
The kernel (1, 1, -1, 1) slides to the next
image patch (1, 1, 1, 1). Multiply and
add: 1×1 + 1×1 + (-1)×1 + 1×1 = 1 + 1 - 1 +
1 = 2. This result (2) is added to the
feature map, following the previous
values (1, 1, 1, 1, 1). The kernel keeps
sliding to fill the feature map, showing
where patterns match.
74. FILTERING: THE MATH BEHIND THE
MATCH
A 3x3 kernel (1, -1, -1, 1, 1, -1, 1, -1, 1)
overlaps with a 3x3 image patch (1, 1,
1, 1, 1, 1, 1, 1, 1). Multiply each pair
and add: 1×1 + (-1)×1 + (-1)×1 + 1×1 +
1×1 + (-1)×1 + 1×1 + (-1)×1 + 1×1 = 1 - 1
- 1 + 1 + 1 - 1 + 1 - 1 + 1 = 1. Divide by 9
(total pixels in the kernel) to get 1. This
value (1) goes into the feature map.
75. CONVOLUTION: APPLYING A SHARPEN
FILTER
A 3x3 sharpen filter (-1, -1, -1, -1, 8, -1, -1, -1, -1)
slides over a 5x5 image. For the green patch
(21, 19, 17, 71, 76, 73, 153, 164, 164), multiply
and add: (-1)×21 + (-1)×19 + (-1)×17 + (-1)×71 +
8×76 + (-1)×73 + (-1)×153 + (-1)×164 + (-1)×164 = -
21 - 19 - 17 - 71 + 608 - 73 - 153 - 164 - 164 = -74.
This result (-74) goes into the feature map,
highlighting edges in the image.
76. FILTERS AND CONVOLUTIONAL FEATURE
MAP
• Filters (or kernels) are small grids of numbers used
to find patterns in an image, like edges or textures.
They slide over the image in a process called
convolution, calculating new values at each step.
This creates a feature map that highlights the
patterns found.
• In this example, two Sobel filters are used: one
detects horizontal edges, and the other detects
vertical edges. These feature maps help in
understanding important parts of the image and are
widely used in computer vision and deep learning.
77. APPLYING A BANK OF FILTERS
A filter bank contains multiple filters that
detect different patterns in an image.
When applied through convolution, they
create feature maps that highlight edges,
textures, and shapes. These filters are
learned automatically during training to
help the model understand the image.
78. CONVOLUTION MULTIPLE
CHANNELS
convolution works with multiple input channels, such as a color image with Red,
Green, and Blue (RGB) layers. Each filter is also made up of multiple layers, one for
each channel. The filter is applied across all channels of the input, and the results are
summed to produce a single value in the output feature map. This process is
repeated for each filter to create multiple output channels. This technique helps
capture more complex features by combining information from all color channels.
79. CONVOLUTION MULTIPLE
CHANNELS
This example shows convolution with multiple channels using a 3D
input volume of size 7×7×3 (height × width × channels).
Each filter (W0 and W1) is also 3D (3×3×3) to match the input depth.
1.The input has 3 channels (like an RGB image), shown in blue boxes.
2. Filter W0 (in red) is applied across all 3 input channels. For each
position:
• Each 3×3 slice from the input is multiplied with the matching filter
slice.
• The results from all 3 slices are summed.
• A bias (b0) is added to produce a single number.
3.This process is repeated across the image to produce one output
channel (green numbers).
4.Similarly, Filter W1 produces a second output channel using the
same steps.
5.The final output volume is 3×3×2, where 2 is the number of filters
used.
80. CNN APPLYING AN ACTIVATION FUNCTION
For a neuron at position (p,q) in the hidden
layer:
Output(p,q)
=
• x(i+p)(j+q)
: input values from a 4×4
patch
• wij
: weights from the
filter
• b: bias
term
• f: activation function (e.g.,
ReLU)
This equation describes:
1.Taking a weighted sum of the input patch
2. Adding a bias
3. Passing the result through an activation
function
81. CONVOLUTION LAYER
This is how a convolution layer works with a 3D
input. The input image is of size 32×32×3 (e.g.,
RGB), and the filter size is 5×5×3. The filter always
matches the depth of the input. It slides over the
image spatially (width and height), computing dot
products at each position. This operation helps
extract meaningful features like edges and
textures from the input.
82. CONVOLUTION LAYER
A 5×5×3 filter is applied to a 32×32×3 input image
by taking a small 5×5×3 patch (75 values) and
computing a dot product with the filter weights.
After adding a bias, this gives one output value.
This operation is repeated across the entire
image.
Mathematically, it's represented as:
83. CONVOLUTION LAYER
A 5×5×3 filter is applied to a 32×32×3 input image
by taking a small 5×5×3 patch (75 values) and
computing a dot product with the filter weights.
After adding a bias, this gives one output value.
This operation is repeated across the entire
image.
Mathematically, it's represented as:
• w^T: Transpose of the weight
vector.
• x: Input vector.
• b: Bias term
84. CONVOLUTION LAYER
A 5×5×3 filter is slid over a 32×32×3 input volume
to compute dot products at each spatial location.
Each computation produces a single number, and
this operation is repeated across the entire input.
The result is a 28×28×1 activation map, where the
reduced spatial dimensions are due to the filter
size and no padding being applied. This activation
map captures local patterns from the input using
the learned filter.
85. CONVOLUTION LAYER
By using a second 5×5×3 filter, an additional
28×28 activation map is generated. Each filter
captures different features from the same input
volume. Stacking the results from both filters
creates multiple activation maps, allowing
deeper representation of the input. This helps
the network learn various patterns like edges,
textures, or colors.
86. CONVOLUTION LAYER
Using 6 filters of size 5×5, each spanning the
full depth of the input (3 channels), produces 6
distinct 28×28 activation maps. These maps are
stacked depth-wise to form a new volume of
size 28×28×6, which serves as the output of the
convolution layer. Each filter learns to detect
different features from the same input.
87. POOLING
LAYER
Pooling layers in CNNs shrink the feature map’s width and height while keeping important features. A
small window slides over the feature map, summarizing each region
For a feature map with dimensions the dimensions of the output after a pooling
layer are:
Types of Pooling
Layers:
• Max Pooling
• Average Pooling
88. SPATIAL DIMENSION - STRIDES
Spatial Dimensional Strides" refers to how the filter moves across the input during convolution. The
stride is the number of pixels the filter shifts at each step. It directly affects the size of the output
(feature map) — larger strides result in smaller outputs.
89. SPATIAL DIMENSION - STRIDES
A 5×5×3 filter slides over a 32×32×3 input with a stride of 1. At each location, it computes a dot product,
resulting in one value in the output. This process continues across all spatial positions, generating a
28×28 activation map. The reduction in size is due to the filter not being applied on the border pixels,
which reduces the width and height by 4 (32 - 5 + 1 = 28).
90. MAX POOLING: SHRINKING THE
DIMENSION
• Max Pooling shrinks a feature map by sliding a
small window over it and taking the largest value
in each region. It reduces the size, keeps
important features, and makes the CNN faster
and more efficient.
• A 2x2 window slides over a 4x4 feature map with
a stride of 2. In each region, the largest value is
taken: [1, 1, 5, 6] becomes 6, [2, 4, 7, 8] becomes
8, [3, 2, 1, 2] becomes 3, and [1, 0, 3, 4] becomes
4. The result is a smaller 2x2 feature map [6, 8, 3,
4]. This reduces the size, keeps key features, and
makes the CNN more efficient.
91. AVERAGE POOLING: SHRINKING THE
DIMENSION
• Average Pooling shrinks a feature map by sliding
a small window over it and taking the average
value of each region, reducing size while
summarizing features.
• A 2x2 window slides over a 4x4 feature map with
a stride of 2. In each region, the average value is
taken: [4, 3, 1, 3] becomes 2.8, [1, 5, 4, 8]
becomes 4.5, [4, 5, 6, 5] becomes 5.0, and [4, 3, 9,
4] becomes 5.0. The result is a smaller 2x2
feature map [2.8, 4.5, 5.0, 5.0]. This reduces the
size, summarizes features, and makes the CNN
more efficient.
92. CONVOLUTION – REDUCTION IN FEATURE
DIMENSION
A 3x3 filter slides over a 4x4 input with padding
to keep the output size the same. The input
(4x4) is padded with zeros around the edges,
making it effectively larger, and the filter
processes each region to produce a 4x4 output.
This maintains the size but extracts features,
preparing the data for further steps like pooling.
93. ‘VALID’ PADDING – REDUCTION IN FEATURE
DIMENSION
• Valid padding in convolution, focusing on
how it reduces the output size compared to
the input. It fits well with your previous
convolution slides, showing a different
padding approach.
• A 2x2 filter slides over a 4x4 input with valid
padding, meaning no extra zeros are added
around the edges. With a stride of 1, the
filter produces a 3x3 output: [1.25, 0.5, 0.5,
0.5, 0.75, 1.5, 0.25, 1.25, 1]. This reduces the
size of the feature map while extracting key
features for the CNN.
94. ‘SAME’ PADDING – NO REDUCTION IN FEATURE
DIMENSION
• Same padding in convolution, focusing on
how it maintains the output size. It fits well
with your previous convolution slides,
showing a different padding approach.
• A 2x2 filter slides over a 4x4 input with same
padding, adding zeros around the edges to
keep the output size 4x4. With a stride of 1,
the filter produces a 4x4 output: [0.5, 0, 0.25,
0.25, 0, 1.25, 0.5, 0.5, 0, 0.5, 0.75, 1.5, 0.5,
0.25, 1.25, 1]. This preserves the size while
extracting features for the CNN.
95. TYPICAL PROCESSING BLOCKS IN A CNN
A CNN takes input feature maps, applies filters in the convolution layer to create convolution
feature maps with features like edges, then uses the pooling layer to shrink them into smaller
pooling feature maps. These layers work together to process images efficiently.
97. CNN ARCHITECTURE
A Convolutional Neural Network (CNN) is made up of layers that automatically learn
features from images. It usually starts with convolution layers that apply filters to
detect patterns like edges or textures. Then, pooling layers reduce the size of the
data while keeping important information. After several layers of convolution and
pooling, the output is flattened into a vector and passed through fully connected
layers, which make the final prediction. CNNs are widely used in image classification
and other computer vision tasks because they can learn to focus on important parts
of the image.
98. FULLY-CONNECTED LAYERS
Fully-connected layers in a CNN are similar to those
in a traditional neural network (MLP). They connect
every neuron in one layer to every neuron in the next.
After convolution and pooling layers extract features,
the fully-connected layer takes these features,
flattens them into a 1D vector, and uses them for final
classification. This part of the network learns patterns
and makes predictions like digit labels or object
categories.
100. CLASSIFICATION WITH FC LAYERS
After the convolution and pooling layers we need to add a Fully Connected
Layers at the end to enable the ANN learn complex patterns from the
feature maps generated by the previous Convolutional Layers.
• CNN layers act as feature extractor.
• FC layer act as feature classifier.
102. DROPOUT LAYER
A Dropout layer helps prevent overfitting by
randomly turning off some neurons during
training. This forces the model to learn
more robust features, improving its ability
to generalize well on new, unseen data.
103. PROJECT DEMO
This project classifies images into predefined categories using a Convolutional Neural
Network (CNN) trained on the Vehicle Image Classification Dataset.
Vehicles Muticlass
Classification
Datase
t:
The Dataset consists of 7 classes Auto Rickshaws , Bikes, Cars, Motorcycles, Planes, Ships,
Trains
Total Images taken - 5590
Splitted into Training - 3906, Testing - 839 and Validation - 845
Note: Missing images may result from file corruption, naming conflicts, path errors, split
issues, or preprocessing losses.
105. Train/Validation/Test Split - Sample
Code
Images are split into train, validate, and test sets using train_test_split()
with fixed random_state. Each image is copied into class-wise folders
using shutil.copy2(), preserving structure for model training.
106. Pre-processing - Sample
Code
ImageDataGenerator loads images from specified directories (flow_from_directory),
applies rescale and target_size, and categorizes them into 7 classes as per class_mode.
107. CNN - Sample
Code
Conv2D layers extract features, MaxPooling2D downsizes, Flatten prepares for Dense layers, Dropout
reduces overfitting, and softmax outputs num_classes probabilities; Adam optimizes with
categorical_crossentropy for multi-class classification.
108. CNN - Sample Code
Output
The Model: "sequential" has layers: conv2d (32, 64, 128 filters), max_pooling2d, flatten,
dense (128, 7 units), and dropout; total params: 4,829,319 (trainable params:
4,829,319).
109. Training the Model with Data Generators - Sample
Code
model.fit() runs training on train_generator data for 20 epochs, evaluating
performance on validation_data from validate_generator.
110. Model Evaluation - Sample
Code
model.evaluate() assesses the model on test_generator, computing loss with
categorical_crossentropy and accuracy as metrics, processing all batches (35/35).
111. Plotting Training and Validation Metrics - Sample
Code
The code uses plt.figure(figsize=(12,
4)), plt.subplot(), and plt.plot() to
graph history.history['accuracy']
(Train Acc),
history.history['val_accuracy'] (Val
Acc), history.history['loss'] (Train
Loss), and history.history['val_loss']
(Val Loss) with plt.legend() and
plt.title().
112. Plotting Training and Validation Metrics - Sample Code
output
The Accuracy plot shows Train Acc rising to ~0.9, Val Acc fluctuating around 0.7-0.8, indicating
overfitting; the Loss plot shows Train Loss dropping to ~0.2, Val Loss decreasing but unstable at
~0.4, suggesting inconsistent validation performance.
113. Confusion Matrix - Sample
Code
confusion_matrix compares
true_classes and predicted_classes,
sns.heatmap visualizes errors with
class_labels on axes, and
classification_report provides
precision, recall, and F1-score per
class.
114. Confusion Matrix - Sample Code
Output
The Confusion Matrix displays
prediction accuracy for 7
classes; high diagonal values
(e.g., 142 Auto Rickshaws, 159
Bikes) show correct predictions,
but errors occur, like 26 Planes
and 22 Trains misclassified as
Ships.
115. Confusion Report - Sample Code
Output
The Classification Report shows Bikes
(0.97) and Motorcycles (0.91) with
highest f1-scores, Planes (0.81) and
Trains (0.83) lowest; overall accuracy is
0.87 for 1,116 samples.
116. Testing - Sample
Code
It processes a single image for
classification, rescaling and predicting
its class as "Bikes" using the trained
model. Predicted Correctly