1. MACHINE LEARNING
(22ISE62)
Module - 4
Dr. Shivashankar
Professor
Department of Information Science & Engineering
GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru
27-06-2025 1
GLOBAL ACADEMY OF TECHNOLOGY
Ideal Homes Township, Rajarajeshwari Nagar, Bengaluru – 560 098
Department of Information Science & Engineering
Dr. Shivashankar-ISE-GAT
2. Course Outcomes
After Completion of the course, student will be able to:
22ISE62.1: Describe the machine learning techniques, their types and data analysis framework.
22ISE62.2: Apply mathematical concepts for feature engineering and perform dimensionality
reduction to enhance model performance.
22ISE62.3: Develop similarity-based learning models and regression models for solving
classification and prediction tasks.
22ISE62.4: Build probabilistic learning models and design neural network models using perceptron
and multilayer architectures.
22ISE62.5: Utilize clustering algorithms to identify patterns in data and implement reinforcement
learning techniques.
Text Book:
1. S Sridhar, M Vijayalakshmi, “Machine Learning”, OXFORD University Press 2021, First Edition.
2. Murty, M. N., and V. S. Ananthanarayana. Machine Learning: Theory and Practice, Universities Press, 2024.
3. T. M. Mitchell, “Machine Learning”, McGraw Hill, 1997.
4. Burkov, Andriy. The hundred-page machine learning book. Vol. 1. Quebec City, QC, Canada: Andriy Burkov,
2019.
27-06-2025 2
Dr. Shivashankar-ISE-GAT
3. Module 4: Bayesian Learning
• Bayesian Learning is a learning method that describes and represents knowledge in an uncertain
domain and provides a way to reason about this knowledge using probability measure.
• It uses Bayes theorem to infer the unknown parameters of a model.
• Bayesian inference is useful in many applications which involve reasoning and diagnosis such as game
theory, medicine, etc.
• Bayesian inference is much more powerful in handling missing data and for estimating any uncertainty
in predictions.
Introduction to Probability-based Learning
• Probability-based learning is one of the most important practical learning methods which combines
prior knowledge or prior probabilities with observed data.
• Probabilistic learning uses the concept of probability theory that describes how to model randomness,
uncertainty, and noise to predict future events.
• It is a tool for modelling large datasets and uses Bayes rule to infer unknown quantities, predict and
learn from data.
• In a probabilistic model, randomness plays a major role which gives probability distribution a solution,
while in a deterministic model there is no randomness and hence it exhibits the same initial conditions
every time the model is run and is likely to get a single possible outcome as the solution.
27-06-2025 3
Dr. Shivashankar-ISE-GAT
4. Fundamentals of Bayes Theorem
Bayes Theorem - A formula that describes the probability of an event, given that another event has
already occurred.
In machine learning, it's crucial for updating probabilities based on new evidence and making predictions in
situations where there's uncertainty.
Prior Probability
It is the general probability of an uncertain event before an observation is seen, or some evidence is
collected. It is the initial probability that is believed before any new information is collected.
Likelihood Probability
• Likelihood probability is the relative probability of the observation occurring for each class or the
sampling density for the evidence given the hypothesis.
• It is stated as P (Evidence | Hypothesis), which denotes the likeliness of the occurrence of the evidence
given the parameters.
Posterior Probability
• It is the updated or revised probability of an event taking into account the observations from the training
data. P (Hypothesis | Evidence) is the posterior distribution representing the belief about the hypothesis,
given the evidence from the training data.
• Therefore, Posterior probability = prior probability + new evidence
27-06-2025 4
Dr. Shivashankar-ISE-GAT
5. Classification Using Bayes Model
• It calculates the conditional probability of an event A given that event B has occurred (P(A|B)).
• Generally, Bayes theorem is used to select the most probable hypothesis from data, considering
both prior knowledge and posterior distributions.
• It is based on the calculation of the posterior probability and is stated as:
P (Hypothesis h | Evidence E)
• where, Hypothesis h is the target class to be classified and Evidence E is the given test instance.
• P (Hypothesis h| Evidence E) is calculated from the prior probability P (Hypothesis h), the
likelihood probability P (Evidence E |Hypothesis h) and the marginal probability P (Evidence E).
• It can be written as:
P (Hypothesis h | Evidence E) =
𝑃 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒 E|𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 ℎ 𝑃(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 ℎ)
𝑃(𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝐸)
where, P (Hypothesis h) is the prior probability of the hypothesis h without observing the training
data or considering any evidence.
P (Evidence E | Hypothesis h) is the prior probability of Evidence E given Hypothesis h.
27-06-2025 5
Dr. Shivashankar-ISE-GAT
6. Conti..
Maximum A Posteriori (MAP) Hypothesis, ℎ𝑀𝐴𝑃 :
• Hypothesis is a proposed explanation or assumption about the relationship between input data
(features) and output predictions.
• It's a model or mapping that the algorithm uses to predict outcomes based on given inputs.
• This most probable hypothesis is called the Maximum A Posteriori Hypothesis ℎ𝑀𝐴𝑃. Bayes theorem Eq.
can be used to find the ℎ𝑀𝐴𝑃.
ℎ𝑀𝐴𝑃 = 𝑚𝑎𝑥ℎ𝜖𝐻 P(Hypothesish |Evidence E) = 𝑚𝑎𝑥ℎ𝜖𝐻 P(Evidenace E | Hypothesish h)
P(Hypothesish h)
Maximum Likelihood (ML) Hypothesis, ℎ𝑀𝐿 :
• Given a set of candidate hypotheses, if every hypothesis is equally probable, only P (E | h) is used to find
the most probable hypothesis.
• The hypothesis that gives the maximum likelihood for P (E | h) is called the Maximum Likelihood (ML)
Hypothesis, ℎ𝑀𝐿.
ℎ𝑀𝐿= 𝑚𝑎𝑥ℎ𝜖𝐻 P(Evidenace E | Hypothesish h)
27-06-2025 6
Dr. Shivashankar-ISE-GAT
7. Conti..
• Correctness of Bayes Theorem Consider two events A and B in a sample space S.
A: T F T T F T T F
B: F T T F T F T F
Solution:
P (A) = 5/8
P (B) = 4/8
P (A | B) = P(A∩B)/P(B) = 2/4
P (B | A) = P(B|A) = P(A∩B)/P(A) = 2/5
P (A | B) = P (B | A) P (A)/ P (B) = = 2/4
P (B | A) = P (A | B) P (B)/ P (A) = = 2/5
27-06-2025 7
Dr. Shivashankar-ISE-GAT
8. Conti..
• Problem 2: Consider a boy who has a volleyball tournament on the next day, but today he feels
sick. It is unusual that there is only a 40% chance he would fall sick since he is a healthy boy.
Now, find the probability of the boy participating in the tournament. The boy is very much
interested in volley ball, so there is a 90% probability that he would participate in tournaments
and 20% that he will fall sick given that he participates in the tournament.
• Solution:
P (Boy participating in the tournament) = 90%
P (He is sick | Boy participating in the tournament) = 20%
P (He is Sick) = 40%
The probability of the boy participating in the tournament given that he is sick is:
P (Boy participating in the tournament | He is sick) = P (Boy participating in the tournament) × P
(He is sick | Boy participating in the tournament)/P (He is Sick)
P (Boy participating in the tournament | He is sick) = (0.9 × 0.2)/0.4 = 0.45
Hence, 45% is the probability that the boy will participate in the tournament given that he is sick.
27-06-2025 8
Dr. Shivashankar-ISE-GAT
9. NAÏVE BAYES ALGORITHM
• It is a supervised binary class or multi class classification algorithm that works on the principle of Bayes
theorem.
• It's considered "naive" because it makes a strong, often unrealistic, assumption: that features are conditionally
independent given the class label.
• This means that the presence or absence of one feature doesn't affect the presence or absence of other
features, which simplifies the calculations.
• It particularly works for a large dataset and is very fast. It is one of the most effective and simple classification
algorithms.
• This algorithm considers all features to be independent of each other even though they are individually
dependent on the classified object.
• Each of the features contributes a probability value independently during classification and hence this
algorithm is called as Naïve algorithm.
• Some important applications of these algorithms are text classification, recommendation system and face
recognition.
Algorithm: Naïve Bayes
1. Compute the prior probability for the target class.
2. Compute Frequency matrix and likelihood Probability for each of the feature.
3. Use Bayes theorem Eq. (8.1) to calculate the probability of all hypotheses.
4. Use Maximum A Posteriori (MAP) Hypothesis, ℎ𝑀𝐴𝑃 to classify the test object
27-06-2025 9
Dr. Shivashankar-ISE-GAT
10. Conti..
Example 8.2: Assess a student’s performance using Naïve Bayes algorithm with the dataset
provided in Table 8.1. Predict whether a student gets a job offer or not in his final year of the
course.
Table 8.1: Training Dataset
.
27-06-2025 10
Dr. Shivashankar-ISE-GAT
Sl. No. CGPA interactiveness Practical
knowledge
Communication skill Job offer
1 ≥ 9 Yes Very good Good Yes
2 ≥ 8 No Good Moderate Yes
3 ≥ 9 No Average Poor No
4 ≥ 8 No Average Good No
5 ≥ 8 Yes Good Moderate Yes
6 ≥ 9 Yes Good Moderate Yes
7 ≥ 8 Yes Good Poor No
8 ≥ 9 No Very good Good Yes
9 ≥ 8 Yes Good Good Yes
10 ≥ 8 Yes Average Good Yes
11. Conti..
Solution:
Step 1: Compute the prior probability for the target feature ‘Job Offer’. The target feature ‘Job Offer’ has
two classes, ‘Yes’ and ‘No’. It is a binary classification problem. Given a student instance, we need to classify
whether ‘Job Offer = Yes’ or ‘Job Offer = No’.
From the training dataset, we observe that the frequency or the number of instances with ‘Job Offer = Yes’
is 7 and ‘Job Offer = No’ is 3.
The prior probability for the target feature is calculated by dividing the number of instances belonging to a
particular target class by the total number of instances.
Hence, the prior probability for ‘Job Offer = Yes’ is 7/10 and ‘Job Offer = No’ is 3/10 as shown in Table 8.2.
Table 8.2: Frequency Matrix and Prior Probability of Job Offer
Step 2: Compute Frequency matrix and Likelihood Probability for each of the feature.
Step 2(a): Feature – CGPA
27-06-2025 11
Dr. Shivashankar-ISE-GAT
Job offer classes No. of instances Probability
Yes 7 P(Job offer = Yes) = 7/10
No 3 P(Job offer = No) = 3/10
12. Conti..
Table 8.3 shows the frequency matrix for the feature CGPA.
Table 8.4 shows how the likelihood probability is calculated for CGPA using conditional probability.
Table 8.4: Likelihood Probability of CGPA.
From the Table 8.3 Frequency Matrix of CGPA, number of instances with ‘CGPA ≥9’ and ‘Job Offer = Yes’ is 3.
The total number of instances with ‘Job Offer = Yes’ is 7.
Hence, P (CGPA ≥9 | Job Offer = Yes) = 3/7.
27-06-2025 12
Dr. Shivashankar-ISE-GAT
CGPA Job offer = Yes Job offer = No
≥ 9 3 1
≥ 8 4 0
<8 0 2
Total 7 3
CGPA P(Job offer = Yes) P(Job offer = No)
≥ 9 P (CGPA ≥9 | Job Offer = Yes) = 3/7 P (CGPA ≥9 | Job Offer = No) = 1/3
≥ 8 P (CGPA ≥8 | Job Offer = Yes) = 4/7 P (CGPA ≥8 | Job Offer = No) = 0/3
<8 P (CGPA <8 | Job Offer = Yes) = 0/7 P (CGPA <8 | Job Offer = No) = 2/3
13. Conti..
• Step 2(b): Feature – Interactiveness
Table 8.5 shows the frequency matrix for the feature Interactiveness.
Table 8.5: Frequency Matrix of Interactiveness
Table 8.6 shows how the likelihood probability is calculated for Interactiveness using conditional probability.
Table 8.6: Likelihood Probability of Interactiveness
27-06-2025 13
Dr. Shivashankar-ISE-GAT
Interactiveness Job offer = Yes Job offer = No
Yes 5 1
No 2 2
Total 7 3
Interactiveness P(Job offer = Yes) P(Job offer = No)
Yes P (Interactiveness = Yes | Job Offer
= Yes) = 5/7
P (Interactiveness = Yes |
Job Offer = No) = 1/3
No P (Interactiveness = No | Job Offer
= Yes) = 2/7
P(Interactiveness = No |
Job Offer = No) = 2/3
14. Conti..
Step 2(c): Feature – Practical Knowledge
Table 8.7 shows the frequency matrix for the feature Practical Knowledge.
Table 8.7: Frequency Matrix of Practical Knowledge
Table 8.8: Likelihood Probability of Practical Knowledge
27-06-2025 14
Dr. Shivashankar-ISE-GAT
Practical knowledge Job offer = Yes Job offer = No
Very good 2 0
Average 1 2
Good 4 1
Total 7 3
Practical knowledge P(Job offer = Yes) P(Job offer = No)
Very good P (Practical Knowledge = Very Good |
Job Offer = Yes) = 2/7
P (Practical Knowledge = Very
Good | Job Offer = No) = 0/3
Average P (Practical Knowledge = Average |
Job Offer = Yes) = 1/7
P (Practical Knowledge =
Average | Job Offer = No) = 2/3
Good P (Practical Knowledge = Good | Job
Offer = Yes) = 4/7
P (Practical Knowledge = Good
| Job Offer = No) = 1/3
15. Conti..
Step 2(d): Feature – Communication Skills
Table 8.9 shows the frequency matrix for the feature Communication Skills.
Table 8.9: Frequency Matrix of Communication Skills
Table 8.10: Likelihood Probability of Communication Skills Communication Skills P (Job Offer = Yes)
27-06-2025 15
Dr. Shivashankar-ISE-GAT
Communication skill Job offer = Yes Job offer = No
Good 4 1
Moderate 3 0
Poor 0 2
Total 7 3
Communication skill P(Job offer = Yes) P(Job offer = No)
Good P (Communication Skills = Good | Job Offer
= Yes) = 4/7
P (Communication Skills = Good
| Job Offer = No) = 1/3
Moderate P (Communication Skills = Moderate | Job
Offer = Yes) = 3/7
P (Communication Skills =
Moderate | Job Offer = No) =
0/3
Poor P (Communication Skills = Poor | Job Offer
= Yes) = 0/7
P (Communication Skills = Poor
| Job Offer = No) = 2/3
16. Conti..
Step 3: Use Bayes theorem Eq.
P (Hypothesis h | Evidence E) =
𝑃 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒 E|𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 ℎ 𝑃(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 ℎ)
𝑃(𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝐸)
to calculate the probability of all hypotheses.
P (Job Offer = Yes | Test data) = (P(CGPA ≥9 | Job Offer = Yes) P (Interactiveness = Yes | Job Offer = Yes) P (Practical
knowledge = Average | Job Offer = Yes) P (Communication Skills = Good | Job Offer = Yes) P (Job Offer = Yes)))/(P (Test
Data))
Given the test data = (CGPA ≥9, Interactiveness = Yes, Practical knowledge = Average, Communication Skills = Good).
Hence, P (Job Offer = Yes | Test data) = (P(CGPA ≥9 |Job Offer = Yes) P (Interactiveness = Yes | Job Offer = Yes) P
(Practical knowledge = Average | Job Offer = Yes) P (Communication Skills = Good | Job Offer = Yes) P (Job Offer = Yes)
= 3/7 × 5/7 × 1/7 × 4/7 × 7/10 = 0.0175
Similarly, for the other case ‘Job Offer = No’,
P (Job Offer = No| Test data) = (P(CGPA ≥9 |Job Offer =No) P (Interactiveness = Yes | Job Offer = No) P (Practical
knowledge = Average | Job Offer = No) P (Communication Skills = Good | Job Offer =No) P (Job Offer = No))/(P(Test
Data)).
P (CGPA ≥9 |Job Offer = No) P (Interactiveness = Yes | Job Offer = No) P (Practical knowledge = Average | Job Offer = No)
P (Communication Skills = Good | Job Offer = No) P (Job Offer = No)
= 1/3 × 1/3 × 2/3 × 1/3 × 3/10 = 0.0074
Step 4: Use Maximum A Posteriori (MAP) Hypothesis, ℎ𝑀𝐴𝑃 to classify the test object to the hypothesis with the highest
probability.
Since P (Job Offer = Yes | Test data) has the highest probability value (0.0175>0.0074, the test data is classified as ‘Job
Offer = Yes’.
27-06-2025 16
Dr. Shivashankar-ISE-GAT
17. Conti…
Zero Probability Error
• In the previous problem data set, consider the test data to be (CGPA ≥8, Interactiveness = Yes, Practical knowledge =
Average, Communication Skills = Good)
When computing the posterior probability,
P (Job Offer = Yes | Test data) = (P(CGPA ≥8 |Job Offer = Yes) P (Interactiveness = Yes | Job Offer = Yes) P (Practical knowledge =
Average | Job Offer = Yes) P (Communication Skills = Good | Job Offer = Yes) P (Job Offer = Yes)))/((P(Test Data))
P (Job Offer = Yes | Test data) = (P(CGPA ≥8 |Job Offer = Yes) P(Interactiveness = Yes | Job Offer = Yes) P (Practical knowledge =
Average | Job Offer = Yes) P (Communication Skills = Good | Job Offer = Yes) P (Job Offer = Yes)
= 4/7 × 5/7 × 1/7 × 4/7 × 7/10 = 0.0233
Similarly, for the other case ‘Job Offer = No’,
When we compute the probability:
P (Job Offer = No| Test data) = (P(CGPA ≥8 |Job Offer = No) P (Interactiveness = Yes | Job Offer = No) P (Practical knowledge =
Average | Job Offer = No) P (Communication Skills = Good | Job Offer = No) P (Job Offer = No))/(P(Test Data))
= P (CGPA ≥8 |Job Offer =No) P (Interactiveness = Yes | Job Offer = No) P (Practical knowledge = Average | Job Offer = No) P
(Communication Skills = Good | Job Offer =No) P (Job Offer = No)
= 0/3 × 1/3 × 2/3 × 1/3 × 3/10
= 0
Since the probability value is zero, the model fails to predict, and this is called as ZeroProbability error.
This problem arises because there are no instances in the given Table 8.1 for the attribute value CGPA ≥8 and Job Offer = No
and hence the probability value of this case is zero.
27-06-2025 17
Dr. Shivashankar-ISE-GAT
18. Conti..
This zero-probability error can be solved by applying a smoothing technique called Laplace correction which means given 1000 data instances in
the training dataset, if there are zero instances for a particular value of a feature we can add 1 instance for each attribute value pair of that
feature which will not make much difference for 1000 data instances and the overall probability does not become zero.
Table 8.11: Scaled Values to 1000 without Laplace Correction
Now, add 1 instance for each CGPA-value pair for ‘Job Offer = No’. Then, P (CGPA ≥9 | Job Offer = No) = 101/303 = 0.333 P (CGPA ≥8 | Job Offer =
No) = 1/303 = 0.0033 P (CGPA <8 | Job Offer = No) = 201/303 = 0.6634 With scaled values to 1003 data instances, we get P (Job Offer = Yes |
Test data) = (P(CGPA ≥8 |Job Offer = Yes) P (Interactiveness = Yes | Job Offer = Yes) P (Practical knowledge = Average | Job Offer = Yes) P
(Communication Skills = Good | Job Offer = Yes) P (Job Offer = Yes) = 400/700 × 500/700 × 100/700 × 400/700 × 700/1003 = 0.02325
27-06-2025 18
Dr. Shivashankar-ISE-GAT
CGPA (Job Offer = Yes) P (Job Offer = No)
≥9 P (CGPA ≥9 | Job Offer = Yes) = 300/700 P (CGPA ≥9 | Job Offer = No) = 100/300
≥8 P (CGPA ≥8 | Job Offer = Yes) = 400/700 P (CGPA ≥8 | Job Offer = No) = 0/300
<8 P (CGPA <8 | Job Offer = Yes) = 0/700 P (CGPA <8 | Job Offer = No) = 200/300
19. Problem 1: Apply the naive Bayes classifier to a concept learning problem, classifying days according to
whether someone will play tennis {outlook=sunny, temperature=cool, humidity=high. Wind=strong}
6/27/2025 19
Dr. Shivashankar, ISE, GAT
Day Outlook Temperature Humidity Wind Play_Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild high Strong No
20. Cont…
Problem 2: Estimate the conditional probabilities of each attributes {color, legs, height, smelly} for the species classes
{M,H} using the data set given in the table. Using these probabilities estimate the probability values for the new instance
{color=green, legs=2, height=tall and smelly=No}.
6/27/2025 20
Dr. Shivashankar, ISE, GAT
No Color Legs Height Smelly Species
1 White 3 Short Yes M
2 Green 2 Tall No M
3 Green 3 Short Yes M
4 White 3 Short Yes M
5 Green 2 Short No H
6 White 2 Tall No H
7 White 2 Tall No H
8 White 2 Short Yes H
21. Bayes Optimal Classifier
• Bayes optimal classifier is a probabilistic model, which in fact, uses the Bayes theorem to find the
most probable classification for a new instance given the training data by combining the
predictions of all posterior hypotheses,
• This is different from Maximum A Posteriori (MAP) Hypothesis, ℎ𝑀𝐴𝑃 which chooses the
maximum probable hypothesis or the most probable hypothesis.
• Here, a new instance can be classified to a possible classification value
𝐶𝑖 = 𝑚𝑎𝑥𝐶𝑖
σℎ𝑖𝜖𝐻 𝑃( 𝐶𝑖I ℎ𝑖) P(ℎ𝑖I T)
27-06-2025 21
Dr. Shivashankar-ISE-GAT
22. Conti..
• Example 8.3: Given the hypothesis space with 4 hypothesis ℎ1, ℎ2, ℎ3 and ℎ4. Determine if the patient is
diagnosed as COVID positive or COVID negative using Bayes Optimal classifier.
Table 8.12: Posterior Probability Values
Solution: ℎ𝑀𝐴𝑃 chooses ℎ1 which has the maximum probability value 0.3 as the solution and gives the result
that the patient is COVID negative. But Bayes Optimal classifier combines the predictions of ℎ2, ℎ3 and ℎ4
which is 0.4 and gives the result that the patient is COVID positive.
σℎ𝑖𝜖𝐻 P(COVID Negative I ℎ𝑖) P(ℎ𝑖I T) = 0.3 X 1 = 0.3
σℎ𝑖𝜖𝐻 P(COVID Negative I ℎ𝑖) P(ℎ𝑖I T) = 0.1 X 1 + 0.2 X 1 + 0.1 X 1 = 0.4
Therefore, 𝑚𝑎𝑥𝑐𝑖[𝐶𝑂𝑉𝐼𝐷 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒,𝐶𝑂𝑉𝐼𝐷 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒] σℎ𝑖𝜖𝐻 𝑃( 𝐶𝑖I ℎ𝑖) P(ℎ𝑖I T) = COVID Positive
Thus, this algorithm, diagnoses the new instance to be COVID positive.
27-06-2025 22
Dr. Shivashankar-ISE-GAT
P(ℎ𝑖𝐼 𝑇) P(COVID Positive I ℎ𝑖) P(COVID Negative I ℎ𝑖)
0.3 0 1
0.1 1 0
0.2 1 0
0.1 1 0
23. NAÏVE BAYES ALGORITHM FOR CONTINUOUS ATTRIBUTES
1. Gaussian Naive Bayes:
• This approach assumes that the probability distribution of each continuous attribute is a
Gaussian (normal) distribution.
• It calculates the probability of a given feature value belonging to a specific class based on the
Gaussian probability density function (PDF).
• The mean (μ) and standard deviation (σ) of the Gaussian distribution are estimated from the
training data for each class.
• The likelihood of a feature value x given a class y is calculated using the Gaussian formula:
P(x|y) = (1 / (σ * sqrt(2π))) * 𝑒𝑥𝑝
(
(𝑥− μ)2
(2∗σ)2
Where:
P(x|y) is the likelihood of feature value x given class y
μ is the mean of the Gaussian distribution for class y
σ is the standard deviation of the Gaussian distribution for class y
exp is the exponential function
π = 3.14159
27-06-2025 23
Dr. Shivashankar-ISE-GAT
24. Conti..
2. Discretization:
• Continuous attributes can be converted into discrete categories by creating intervals or bins.
• For example, a temperature value could be classified into categories like "low", "medium", or "high".
• Different discretization methods can be used, such as:
• Equal-width binning: Dividing the range of the attribute into equal-sized intervals.
• Equal-frequency binning: Dividing the range into intervals such that each interval contains the same
number of data points.
• Quartiles: Dividing the data into four groups based on percentiles (25th, 50th, 75th).
• Once discretized, the attributes can be treated as discrete variables in the Naive Bayes algorithm
27-06-2025 24
Dr. Shivashankar-ISE-GAT
25. Conti..
Problem 1: Assess a student’s performance using Naïve Bayes algorithm for the continuous attribute.
Predict whether a student gets a job offer or not in his final year of the course. The training dataset T
consists of 10 data instances with attributes such as ‘CGPA’ and ‘Interactiveness’ as shown in Table 8.13. The
target variable is Job Offer which is classified as Yes or No for a candidate student.
Table 8.13: Training Dataset with Continuous Attribute
27-06-2025 25
Dr. Shivashankar-ISE-GAT
Sl.N0. CGPA Interactiveness Job offer
1 9.5 Yes Yes
2 8.2 No Yes
3 9.3 No No
4 7.6 No No
5 8.4 Yes Yes
6 9.1 Yes Yes
7 7.5 Yes No
8 9.6 No Yes
9 8.6 Yes Yes
10 8.3 Yes Yes
26. Conti..
• Solution: Step 1: Compute the prior probability for the target feature ‘Job Offer’.
• Table 8.14: Prior Probability of Target Class Job Offer Classes No. of Instances
• Step 2: Compute Frequency matrix and Likelihood Probability for each of the feature.
• Gaussian distribution for continuous feature is calculated using the given formula,
P(𝑋𝑖 = 𝑥𝑘 𝐼 𝐶𝑗) = g(𝑥𝑘, 𝜇𝑖𝑗𝜎𝑖𝑗)
where, 𝑋𝑖is the 𝑖𝑡ℎ continuous attribute in the given dataset and 𝑥𝑘 is a value of the attribute.
𝐶𝑗 denotes the j th class of the target feature. 𝜇𝑖𝑗 denotes the mean of the values of that continuous attribute 𝑋𝑖
with respect to the class j of the target feature. 𝜎𝑖𝑗 denotes the standard deviation of the values of that
continuous attribute 𝑋𝑖 with respect to the class j of the target feature. Hence, the normal distribution formula is
given as:
P(𝑋𝑖 = 𝑥𝑘 𝐼 𝐶𝑗) =
1
𝜎𝑖𝑗 2𝜋
𝑒
−
𝑥𝑘−𝜇𝑖𝑗
2
2𝜎𝑖𝑗
2
27-06-2025 26
Dr. Shivashankar-ISE-GAT
Job offer classes No. of instances Probability value
Yes 7 P (Job Offer = Yes) = 7/10
No 3 P (Job Offer = No) = 3/10
27. Conti..
Step 2(a): Consider the feature CGPA
To calculate the likelihood probability for this continuous attribute, first compute the mean and standard
deviation for CGPA with respect to the target class ‘Job Offer’.
Here, 𝑋𝑖 = CGPA
𝐶𝑗 = ‘Job Offer = Yes’ Mean and Standard Deviation for class ‘Job Offer = Yes’ are given as:
Mean = µCGPA − YES = (9.5+8.2+8.4+9.1+9.6+8.6+8.3)/7 = 8.814286
σij = σCGPA − YES =
𝑥𝑖−µ 2
𝑁−1
= 0.58146
Mean and Standard Deviation for class ‘Job Offer = No’ are given as:
Cj = ‘Job Offer = No’
µij = µCGPA − NO = 8.13333
σij = σCGPA − NO = 1.011599
Once Mean and Standard Deviation are computed, the likelihood probability for any test value using Gaussian
distribution formula can be calculated.
.
27-06-2025 27
Dr. Shivashankar-ISE-GAT
28. Conti..
Step 2(b): Consider the feature Interactiveness
Table 8.15: Frequency Matrix of Interactiveness
Table 8.16 shows how the likelihood probability is calculated for Interactiveness using conditional
probability.
Table 8.16: Likelihood Probability of Interactiveness
27-06-2025 28
Dr. Shivashankar-ISE-GAT
Interactiveness Job offer = Yes Job offer = No
Yes 5 1
No 2 2
Total 7 3
Interactiveness P(Job offer = Yes) P(Job offer = No)
Yes P (Interactiveness = Yes |
Job Offer = Yes) = 5/7
P (Interactiveness = Yes |
Job Offer = No) = 1/3
No P (Interactiveness = No |
Job Offer = Yes) = 2/7
P (Interactiveness = No |
Job Offer = No) = 2/3
29. Conti..
Step 3: Use Bayes theorem to calculate the probability of all hypotheses.
Consider the test data to be (CGPA = 8.5, Interactiveness = Yes).
For the hypothesis ‘Job Offer = Yes’:
P (Job Offer = Yes | Test data) = (P(CGPA = 8.5 | Job Offer = Yes) × P (Interactiveness = Yes | Job Offer = Yes) ×
P (Job Offer = Yes)
To compute P (CGPA = 8.5 | Job Offer = Yes) use Gaussian distribution formula:
P(𝑋𝑖 = 𝑥𝑘 𝐼 𝐶𝑗) = g(𝑥𝑘, 𝜇𝑖𝑗𝜎𝑖𝑗)
P(𝑋𝐶𝐺𝑃𝐴 = 8.5| 𝐶𝐽𝑜𝑏 𝑜𝑓𝑓𝑒𝑟=𝑌𝑒𝑠) =
1
𝜎𝐶𝐺𝑃𝐴−𝑌𝐸𝑆 2𝜋
𝑒
−
8.5−𝜇𝐶𝐺𝑃𝐴−𝑌𝐸𝑆
2
2∗𝜎𝐶𝐺𝑃𝐴−𝑌𝐸𝑆
2
P(CGPA = 8.5 |Job offer = Yes) = g(𝑥𝑘 = 8.5, 𝜇𝑖𝑗 = 8.814, 𝜎𝑖𝑗 = 0.581) =
1
0.581 2𝜋
𝑒
−
8.5−8.814 2
2𝑋0.5812
= 0.594
P (Interactiveness = Yes|Job Offer = Yes) = 5/7
P (Job Offer = Yes) = 7/10
Hence: P (Job Offer = Yes | Test data) = (P(CGPA = 8.5 | Job Offer = Yes) × P (Interactiveness = Yes|Job Offer
= Yes) × P (Job Offer = Yes) = 0.594 × 5/7 × 7/10 = 0.297
27-06-2025 29
Dr. Shivashankar-ISE-GAT
30. Conti..
Similarly, for the hypothesis ‘Job Offer = No’:
P (Job Offer = No | Test data) = P (CGPA = 8.5 | Job Offer = No) × P (Interactiveness = Yes | Job Offer = No) ×
P (Job Offer = No)
P(CGPA = 8.5 |Job offer = No) = g(𝑥𝑘 = 8.5, 𝜇𝑖𝑗 = 8.133, 𝜎𝑖𝑗 = 1.0116)
P(𝑋𝐶𝐺𝑃𝐴 = 8.5| 𝐶𝐽𝑜𝑏 𝑜𝑓𝑓𝑒𝑟=𝑁𝑜) =
1
𝜎𝐶𝐺𝑃𝐴−𝑁𝑜 2𝜋
𝑒
−
8.5−𝜇𝐶𝐺𝑃𝐴−𝑁𝑜
2
2𝜎𝐶𝐺𝑃𝐴−𝑁𝑜2
=
1
1.0116 2𝜋
𝑒
−
8.5−8.133 2
2𝑋1.01162
= 0.369
P (Interactiveness = Yes | Job Offer = No) = 1/3
P (Job Offer = No) = 0.369
Hence,
P (Job Offer = No | Test data) = P (CGPA = 8.5 | Job Offer = No) P (Interactiveness = Yes | Job Offer = No) × P
(Job Offer = No) = 0.369 × 1/3 × 3/10 = 0.0369
Step 4: Use Maximum A Posteriori (MAP) Hypothesis, ℎ𝑀𝐴𝑃 to classify the test object to the hypothesis with
the highest probability.
Since P (Job Offer = Yes | Test data) has the highest probability value of 0.297,the test data is classified as
‘Job Offer = Yes’.
27-06-2025 30
Dr. Shivashankar-ISE-GAT
31. Conti..
Problem 2: Take a real-time example of predicting the result of a student using Naïve Bayes algorithm. The
training dataset T consists of 8 data instances with attributes such as ‘Assessment’, ‘Assignment’, ‘Project’ and
‘Seminar’ as shown in Table 8.17. The target variable is Result which is classified as Pass or Fail for a candidate
student. Given a test data to be (Assessment = Average, Assignment = Yes, Project = No and Seminar = Good),
predict the result of the student.
Table 8.17: Training Dataset
P(Pass | data) ≈ 0.008
P(Fail | data) ≈ 0.00617
• Prediction: Pass
27-06-2025 31
Dr. Shivashankar-ISE-GAT
Sl. No. Assessment Assignment Project Seminar Result
1 Good Yes Yes Good Pass
2 Average Yes No Poor Fail
3 Good No Yes Good Pass
4 Average No No Poor Fail
5 Average No Yes Good Pass
6 Good No No Poor Pass
7 Average Yes Yes Good Fail
8 Good Yes Yes Poor Pass
32. Conti..
Problem 3: Take a real-time example of predicting the result of a stolen using Naïve Bayes algorithm. The
training dataset T consists of 10data instances with attributes such as ‘color’, ‘type’ and ‘origin’ as shown in Table.
The target variable is stolen which is classified as yes or No.
Table 4.12: Dataset
Example No. Color Type Origin Stolen
1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
27-06-2025 32
Dr. Shivashankar-ISE-GAT
33. Artificial Neural Networks
• A neural network is a type of machine learning algorithm inspired by the human brain.
• It's a powerful tool that excels at solving complex problems more difficult for traditional computer algorithms to
handle, such as image recognition and natural language processing.
• Artificial Neural Networks (ANNs) are computational models inspired by the biological neural networks of the
human brain.
• They are used in machine learning to analyze data and make predictions by processing information through
interconnected nodes, or "neurons".
• The human brain constitutes a mass of neurons that are all connected as a network, which is actually a directed
graph.
• These neurons are the processing units which receive information, process it and then transmit this data to
other neurons that allows humans to learn almost any task.
• ANN is a learning mechanism that models a human brain to solve any non-linear and complex problem. Each
neuron is modelled as a computing unit, or simply called as a node in ANN, that is capable of doing complex
calculations.
• ANN is a system that consists of many such computing units operating in parallel that can learn from
observations.
• Some typical applications of ANN in the field of computer science are Natural Language Processing (NLP),
pattern recognition, face recognition, speech recognition, character recognition, text processing, stock
prediction, computer vision, etc.
• ANNs also have been considerably used in other engineering fields such as Chemical industry, Medicine,
Robotics, Communications, Banking, and Marketing.
27-06-2025 33
Dr. Shivashankar-ISE-GAT
34. Conti..
• The human nervous system has billions of neurons that are the processing units which make
humans to perceive things, to hear, to see and to smell.
• It makes us to remember, recognize and correlate things around us. It is a learning system that
consists of functional units called nerve cells, typically called as neurons.
• The human nervous system is divided into two sections called the Central Nervous System (CNS)
and the Peripheral Nervous System (PNS).
• The brain and the spinal cord constitute the CNS and the neurons inside and outside the CNS
constitute the PNS. The neurons are basically classified into three types called sensory neurons,
motor neurons and interneurons.
• Sensory neurons get information from different parts of the body and bring it into the CNS,
whereas motor neurons receive information from other neurons and transmit commands to the
body parts.
• The CNS consists of only interneurons which connect one neuron to another neuron by
receiving information from one neuron and transmitting it to another.
• The basic functionality of a neuron is to receive information, process it and then transmit it to
another neuron or to a body part.
27-06-2025 34
Dr. Shivashankar-ISE-GAT
35. Biological Neurons
A typical biological neuron has four parts called dendrites, soma, axon and synapse.
The body of the neuron is called as soma.
• Dendrites accept the input information and process it in the cell body called soma.
• A single neuron is connected by axons to around 10,000 neurons and through these axons the processed
information is passed from one neuron to another neuron.
• A neuron gets fired if the input information crosses a threshold value and transmits signals to another
neuron through a synapse.
• A synapse gets fired with an electrical impulse called spikes which are transmitted to another neuron.
• A single neuron can receive synaptic inputs from one neuron or multiple neurons.
• These neurons form a network structure which processes input information and gives out a response.
Figure 10.1: A Biological Neuron
27-06-2025 35
Dr. Shivashankar-ISE-GAT
36. Artificial Neurons
• Artificial neurons are like biological neurons which are called as nodes.
• A node or a neuron can receive one or more input information and process it.
• Artificial neurons or nodes are connected by connection links to one another.
• Each connection link is associated with a synaptic weight.
Figure 10.2: Artificial Neurons
27-06-2025 36
Dr. Shivashankar-ISE-GAT
37. Simple Model of an Artificial Neuron
The first mathematical model of a biological neuron was designed by McCulloch & Pitts in 1943. It includes
two steps:
1. It receives weighted inputs from other neurons
2. It operates with a threshold function or activation function
• The received inputs are computed as a weighted sum which is given to the activation function and if the
sum exceeds the threshold value the neuron gets fired.
• The neuron is the basic processing unit that receives a set of inputs 𝑥1, 𝑥2, 𝑥3,… 𝑥𝑛and their associated
weights 𝑤1, 𝑤2, 𝑤3,…. 𝑤𝑛.
• The Summation function ‘Net-sum’ Eq. (10.1) computes the weighted sum of the inputs received by the
neuron.
𝑁𝑒𝑡𝑆𝑢𝑚 =
𝑖=1
𝑛
𝑥𝑖𝑤𝑖
The activation function is a binary step function which outputs a value 1 if the Net-sum is above the
threshold value q, and a 0 if the Net-sum is below the threshold value q.
Therefore, the activation function is applied to Net-sum as shown in below equation.
f(x) = Activation function (Net - sum)
Then, output of a neuron 𝛾 = ቊ
1 𝑖𝑓 𝑓 𝑥 ≥ 𝜃
0 𝑖𝑓 𝑓 𝑥 < 𝜃
27-06-2025 37
Dr. Shivashankar-ISE-GAT
38. Artificial Neural Network Structure
• Artificial Neural Network (ANN) imitates a human brain which inhibits some intelligence.
• It has a network structure represented as a directed graph with a set of neuron nodes and connection links or edges
connecting the nodes.
• The nodes in the graph are arrayed in a layered manner and can process information in parallel. The network given in
the figure has three layers called input layer, hidden layer and output layer. The input layer receives the input
information (𝑥1, 𝑥2, 𝑥3,… 𝑥𝑛) and passes it to the nodes in the hidden layer.
• The edges connecting the nodes from the input layer to the hidden layer are associated with synaptic weights called
as connection weights.
• These computing nodes or neurons perform some computations based on the input information (𝑥1, 𝑥2, 𝑥3,… 𝑥𝑛)
received and if the weighted sum of the inputs to a neuron is above the threshold or the activation level of the
neuron, then the neuron fires.
• Each neuron employs an activation function that determines the output of the neuron.
Figure 10.4: Artificial Neural Network Structure
27-06-2025 38
Dr. Shivashankar-ISE-GAT
39. Activation Function
Activation functions are mathematical functions associated with each neuron in the neural network that map input
signals to output signals.
It decides whether to fire a neuron or not based on the input signals the neuron receives.
These functions normalize the output value of each neuron either between 0 and 1 or between -1 and +1.
Below are some of the activation functions used in ANNs:
1. Identity Function or Linear Function f(x) = x ∀x
The value of f(x) increases linearly or proportionally with the value of x. This function is useful when we do not want to
apply any threshold. The output would be just the weighted sum of input values. The output value ranges between -∞
and +∞.
2. Binary Step Function:
f(x)= ቊ
1 𝑖𝑓 𝑓 𝑥 ≥ 𝜃
0 𝑖𝑓 𝑓 𝑥 < 𝜃
The output value is binary, i.e., 0 or 1 based on the threshold value q. If value of f(x) is greater than or equal to q, it
outputs 1 or else it outputs 0.
3. Bipolar Step Function:
f(x)= ቊ
1 𝑖𝑓 𝑓 𝑥 ≥ 𝜃
0 𝑖𝑓 𝑓 𝑥 < 𝜃
The output value is bipolar, i.e., +1 or -1 based on the threshold value q. If value of f(x) is greater than or equal to q, it
outputs +1 or else it outputs -1.
.
27-06-2025 39
Dr. Shivashankar-ISE-GAT
40. Conti..
4. Sigmoidal Function or Logistic Function
𝜎(x) =
1
1+𝑒−𝑥
It is a widely used non-linear activation function which produces an S-shaped curve and the output values
are in the range of 0 and 1.
5. Bipolar Sigmoid Function
𝜎(x) =
1−𝑒−𝑥
1+𝑒−𝑥
It outputs values between -1 and +1.
6. Ramp Functions
f(x)= ൞
1 𝑖𝑓 𝑥 > 1
𝑥 𝑖𝑓 0 ≤ 𝑥 ≤ 1𝑓 𝑥 < 𝜃
0 𝑖𝑓 𝑥 < 0
It is a linear function whose upper and lower limits are fixed.
7. Tanh – Hyperbolic Tangent Function
The Tanh function is a scaled version of the sigmoid function which is also non-linear. It also suffers from the
vanishing gradient problem. The output values range between -1 and 1.
tan h(x) =
1
1+𝑒−2𝑥 − 1
27-06-2025 40
Dr. Shivashankar-ISE-GAT
41. Conti..
8. ReLu – Rectified Linear Unit Function
This activation function is a typical function generally used in deep learning neural network models in the
hidden layers. It avoids or reduces the vanishing gradient problem. This function outputs a value of 0 for
negative input values and works like a linear function if the input values are positive.
r(x) = max (0, x) = f(x)= ቊ
𝑥 𝑖𝑓 𝑥 ≥ 𝜃
0 𝑖𝑓 𝑥 < 𝜃
9. Softmax Function
This is a non-linear function used in the output layer that can handle multiple classes.
It calculates the probability of each target class which ranges between 0 and 1.
The probability of the input belonging to a particular class is computed by dividing the exponential of the
given input value by the sum of the exponential values of all the inputs.
𝑆 𝑥𝑖 =
𝑒𝑥𝑖
σ𝑗=0
𝑘
𝑒𝑥𝑗
27-06-2025 41
Dr. Shivashankar-ISE-GAT
42. PERCEPTRON AND LEARNING THEORY
• A perceptron is a fundamental unit in neural networks, essentially a model of a biological neuron.
• It's a binary classifier that takes multiple inputs, applies weights and a bias, and then uses an activation
function to produce a single output, typically 0 or 1.
• The perceptron algorithm learns by adjusting the weights to minimize the error between its prediction
and the desired output.
The perceptron model consists of 4 steps:
1. Inputs from other neurons
2. Weights and bias
3. Net sum
4. Activation function
The summation function ‘Net-sum’ Eq.
computes the weighted sum of the inputs received by the neuron.
𝑁𝑒𝑡𝑆𝑢𝑚 = σ𝑖=1
𝑛
𝑥𝑖𝑤𝑖
27-06-2025 42
Dr. Shivashankar-ISE-GAT
43. Conti..
After computing the ‘Net-sum’, bias value is added to it and inserted in the activation function as shown below:
f(x) = Activation function (Net-sum + bias)
The activation function is a binary step function which outputs a value 1 if f(x) is above the threshold value q, and
a 0 if f(x) is below the threshold value q.
Then, output of a neuron: f(x)= ቊ
1 𝑖𝑓 𝑓 𝑥 ≥ 𝜃
0 𝑖𝑓 𝑓 𝑥 < 𝜃
Set initial weights 𝑤1, 𝑤2, 𝑤3, … . . 𝑤𝑛 and bias 𝜃 to a random value in the range [-0.5, 0.5].
For each Epoch,
1. Compute the weighted sum by multiplying the inputs with the weights and add the products.
2. Apply the activation function on the weighted sum:
Y = Step ((𝑥1𝑤1 + 𝑥2𝑤2) – q)
3. If the sum is above the threshold value, output the value as positive else output the value as negative.
4. Calculate the error by subtracting the estimated output 𝑌𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 from the desired output 𝑌𝑑𝑒𝑠𝑖𝑔𝑒𝑟𝑒𝑑:
Error e(t) = 𝑌𝑑𝑒𝑠𝑖𝑔𝑒𝑟𝑒𝑑 - 𝑌𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑
5. Update the weights if there is an error: ∆𝑤𝑖 = 𝛼 𝑋 𝑒 𝑡 𝑋 𝑥𝑖,
where, 𝑥𝑖 is the input value, e(t) is the error at step t, 𝛼 is the learning rate and ∆𝑤𝑖 is the difference in weight
that has to be added to 𝑤𝑖.
27-06-2025 43
Dr. Shivashankar-ISE-GAT
44. Conti..
• Problem 1: Consider a perceptron to represent the Boolean function AND with the initial weights 𝑤1 =
0.3, 𝑤2 = -0.2, learning rate ∝ = 0.2 and bias 𝜃 = 0.4 as shown in Figure 4.5. The activation function used
here is the Step function f(x) which gives the output value as binary, i.e., 0 or 1. If value of f(x) is greater
than or equal to 0, it outputs 1 or else it outputs 0. Design a perceptron that performs the Boolean
function AND and update the weights until the Boolean function gives the desired output.
Figure 4.5: Perceptron for Boolean function AND.
27-06-2025 44
Dr. Shivashankar-ISE-GAT
48. Conti..
For input (1, 0) the weights are updated as follows:
∆𝑤1 = 𝛼 𝑋 𝑒 𝑡 𝑋 𝑥1 = 0.2 × -1 × 1 = -0.2
𝑤1 = 𝑤1 +∆𝑤1= 0.5 - 0.2 = 0.3
∆𝑤2 = 𝛼 𝑋 𝑒 𝑡 𝑋 𝑥2 = 0.2 × -1 × 0 = 0
𝑤2 = 𝑤2 +∆𝑤2= 0.2 + 0 = 0.2
Table 10.5: Epoch 4
It is observed that with 4 Epochs, the perceptron learns and the weights are updated to 0.3 and 0.2 with
which the perceptron gives the desired output of a Boolean AND function.
27-06-2025 48
Dr. Shivashankar-ISE-GAT
Epoch 𝑥1 𝑥2 𝑌𝑑𝑒𝑠 𝑌𝑒𝑠𝑡 Error 𝑤1 𝑤2 Status
4 0 0 0 Step ((0 × 0.3 + 0 × 0.2) – 0.4) = 0 0 0.3 0.2 No change
0 1 0 Step ((0 × 0.3 + 1 × 0.2) – 0.4) = 0 0 0.3 0.2 No change
1 0 0 Step ((1 × 0.3 + 0 × 0.2) – 0.4) = 0 0 0.3 0.2 No change
1 1 1 Step ((1 × 0.3 + 1 × 0.2) – 0.4) = 1 0 0.3 0.2 No change
49. Problem
Problem 1: Assume 𝑤1 = 0.6 𝑎𝑛𝑑 𝑤2 = 0.6, 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 1 𝑎𝑛𝑑 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 ƞ=0.5.
Compute OR gate using perceptron training rule.
Solution :
1. A=0, B=0 and target=0
𝑤𝑖𝑥𝑖 = 𝑤1𝑥1+𝑤2𝑥2
=0.6*0+0.6*0=0
This is not greater than the threshold value of 1.
So the output =0
2. A=0, B=1 and target =1
𝑤𝑖𝑥𝑖 = 0.6 ∗ 0 +0.6*1= 0.6
This is not greater than the threshold value of 1. So the output =0.
𝑤𝑖=𝑤𝑖+ƞ(t-o) 𝑥𝑖
𝑤1=0.6+0.5(1-0)0=0.6
𝑤2=0.6+0.5(1-0)1=1.1
Now 𝒘𝟏=0.6, 𝒘𝟐=1.1, threshold = 1 and learning rate ƞ=0.5
6/27/2025 49
Dr. Shivashankar, ISE, GAT
A B Y=A+B
(Target)
0 0 0
0 1 1
1 0 1
1 1 1
50. Problem
• Problem 4: Consider NAND gate, compute Perceptron training rule with W1=1.2,
W2=0.6 threshold =-1 and learning rate=1.5.
• Solution:
6/27/2025 50
Dr. Shivashankar, ISE, GAT
A B Y=𝐴. 𝐵
0 0 1
0 1 1
1 0 1
1 1 0
51. Delta Learning Rule and Gradient Descent
• Generally, learning in neural networks is performed by adjusting the network weights in order to
minimize the difference between the desired and estimated outputs.
• This delta difference is measured as an error function or also called as cost function.
• The cost function, being linear and continuous, is differentiable.
• This way of learning called as delta rule (also known as Widrow-Hoff” rule or Adaline rule) is a
type of back propagation applied for training the network.
• The training error of a hypothesis is half the squared difference between the desired target
output and actual output and is given as follows:
Training Error =
1
2
σ𝑑𝑒𝑇 𝑂𝑑𝑒𝑠𝑖𝑟𝑒𝑑 − 𝑂𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑
2
where, T is the training dataset, 𝑂𝑑𝑒𝑠𝑖𝑟𝑒𝑑 and 𝑂𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 are the desired target output and
estimated actual output, respectively, for a training instance d.
The principle of gradient descent is an optimization.
• Gradient descent learning is the foundation of back propagation algorithm used in MLP.
• Before we study about an MLP, let us first understand the different types of neural networks that
differ in their structure, activation function and learning mechanism.
27-06-2025 51
Dr. Shivashankar-ISE-GAT
52. TYPES OF ARTIFICIAL NEURAL NETWORKS
• ANNs consist of multiple neurons arranged in layers.There are different types of ANNs that differ by the
network structure, activation function involved and the learning rules used.
• In an ANN, there are three layers called input layer, hidden layer and output layer.
• Any general ANN would consist of one input layer, one output layer and zero or more hidden layers.
1. Feed Forward Neural Network
• This is the simplest neural network that consists of neurons which are arranged in layers and the information is
propagated only in the forward direction.
• This model may or may not contain a hidden layer and there is no back propagation.
• Based on the number of hidden layers they are further classified into single-layered and multi-layered feed
forward networks.
• These ANNs are simple to design and easy to maintain.
• They are fast but cannot be used for complex learning.
• They are used for simple classification and simple image
processing, etc.
Figure 10.7: Model of a Feed Forward Neural Network
27-06-2025 52
Dr. Shivashankar-ISE-GAT
53. Conti..
2 Fully Connected Neural Network
• A fully connected neural network, also known as a dense or feedforward neural network, is an artificial
neural network where each neuron in one layer is connected to every neuron in the subsequent layer.
• Information flows unidirectionally, from input to output, without any loops or feedback connections.
Figure 8: Model of a Fully Connected Neural Network
27-06-2025 53
Dr. Shivashankar-ISE-GAT
54. Conti..
3. Multi-Layer Perceptron (MLP)
• This ANN consists of multiple layers with one input layer, one output layer and one or more hidden layers.
• Every neuron in a layer is connected to all neurons in the next layer and thus they are fully connected.
• The information flows in both the directions. In the forward direction, the inputs are multiplied by
weights of neurons and forwarded to the activation function of the neuron and output is passed to the
next layer.
• If the output is incorrect, then in the backward direction, error is back propagated to adjust the weights
and biases to get correct output.
• Thus, the network learns with the training data.
• This type of ANN is used in deep learning for complex
• classification, speech recognition, medical diagnosis, forecasting, etc.
Figure 10.9: Model of a Multi-Layer Perceptron
27-06-2025 54
Dr. Shivashankar-ISE-GAT
55. Feedback Neural Network
• Feedback neural networks have feedback connections between neurons that allow information flow in
both directions in the network.
• The output signals can be sent back to the neurons in the same layer or to the neurons in the preceding
layers.
• Hence, this network is more dynamic during training.
• It allows the network to learn from its previous outputs and adapt to dynamic environments.
• This iterative process, where outputs are reused as inputs, enables the network to refine its performance
over time and improve its ability to handle complex tasks or changing data.
Figure 10: Model of a Feedback Neural Network
27-06-2025 55
Dr. Shivashankar-ISE-GAT
56. POPULAR APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS
ANN learning mechanisms are used in many complex applications that involve modelling of non-
linear processes.
ANN is a useful model that can handle even noisy and incomplete data.
They are used to model complex patterns, recognize patterns and solve prediction problems like
humans in many areas such as:
1. Real-time applications: Face recognition, emotion detection, self-driving cars, navigation
systems, routing systems, target tracking, vehicle scheduling, etc.
2. Business applications: Stock trading, sales forecasting, customer behaviour modelling, Market
research and analysis, etc.
3. Banking and Finance: Credit and loan forecasting, fraud and risk evaluation, currency price
prediction, real-estate appraisal, etc.
4. Education: Adaptive learning software, student performance modelling, etc.
5. Healthcare: Medical diagnosis or mapping symptoms to a medical case, image interpretation
and pattern recognition, drug discovery, etc.
6. Other Engineering Applications: Robotics, aerospace, electronics, manufacturing,
communications, chemical analysis, food research, etc.
27-06-2025 56
Dr. Shivashankar-ISE-GAT
57. ADVANTAGES AND DISADVANTAGES OF ANN
Advantages of ANN
1. ANN can solve complex problems involving non-linear processes.
2. ANNs can learn and recognize complex patterns and solve problems as humans solve a
problem.
3. ANNs have a parallel processing capability and can predict in less time.
4. They have an ability to work with inadequate knowledge. It can even handle incomplete and
noisy data.
They can scale well to larger data sets and outperforms other learning mechanisms.
Limitations of ANN
1. An ANN requires processors with parallel processing capability to train the network running for
many epochs. The function of each node requires a CPU capability which is difficult for very large
networks with a large amount of data.
2. They work like a ‘black box’ and it is exceedingly difficult to understand their working in inner
layers. Moreover, it is hard to understand the relationship between the representations learned at
each layer.
27-06-2025 57
Dr. Shivashankar-ISE-GAT
58. CHALLENGES OF ARTIFICIAL NEURAL NETWORKS
The major challenges while modelling a real-time application with ANNs are:
1. Training a neural network is the most challenging part of using this technique. Overfitting or
underfitting issues may arise if datasets used for training are not correct. It is also hard to
generalize to the real-world data when trained with some simulated data. Moreover, neural
network models normally need a lot of training data to be robust and are usable for a real-
time application.
2. Finding the weight and bias parameters
27-06-2025 58
Dr. Shivashankar-ISE-GAT