SlideShare a Scribd company logo
0/32/1/0
0
Regulated Software Research Centre (RSRC)
Dundalk Institute of Technology
Artificial Neural Networks
Muhammad Adil Raja
July 6, 2023
Muhammad Adil Raja | Artificial Neural Networks
1/32/1/0
1
Content
Introduction
Connectionist Models
2 ANN examples: ALVINN driving system, face recognition
The Perceptron
The Linear Unit
The Sigmoid Unit
The Gradient Descent Learning Rule
Multilayer Networks of Sigmoid Units
The Backpropagation Algorithm
Hidden Layer Representations
Overfitting in NNs
Advanced Topics
Alternative Error Functions
Recurrent Neural Networks
Muhammad Adil Raja | Artificial Neural Networks
2/32/1/1
2
Connectionist Models
Consider humans:
▶ Neuron switching time ˜ .001
second
▶ Number of neurons ˜ 1010
▶ Connections per neuron ˜
104−5
▶ Scene recognition time ˜ .1
second
▶ 100 inference steps doesn’t
seem like enough
→ much parallel computation
Properties of artificial neural nets
(ANN’s):
▶ Many neuron-like threshold
switching units.
▶ Many weighted
interconnections among units.
▶ Highly parallel, distributed
process.
▶ Emphasis on tuning weights
automatically
Muhammad Adil Raja | Artificial Neural Networks
3/32/1/1
3
A First ANN Example:
ALVINN Drives at 70 mph on Highways
4 Hidden
Units
30 Output
Units
30x32 Sensor
Input Retina
Muhammad Adil Raja | Artificial Neural Networks
4/32/1/1
4
A Second ANN Example:
ANNs for Face Recognition
... ...
Figure: 30×32 inputs.
Figure: Results: 90% accurate
learning head pose, and
recognizing 1 – of – 20 faces.
Muhammad Adil Raja | Artificial Neural Networks
5/32/1/1
5
Learned Weights
Muhammad Adil Raja | Artificial Neural Networks
6/32/1/1
6
When to Consider ANNs
▶ Input is high-dimensional discrete or real-valued (e.g. raw sensor
input)
▶ Output is discrete or real-valued
▶ Output is a vector of values
▶ Possibly noisy data
▶ Form of the target function is unknown
▶ Human readability of result is unimportant
Examples:
▶ Speech phoneme recognition [Waibel].
▶ Image classification [Kanade, Baluja, Rowley].
▶ Financial prediction.
Muhammad Adil Raja | Artificial Neural Networks
7/32/1/1
7
A Timeless Meme
Figure: A very relevant meme.
Muhammad Adil Raja | Artificial Neural Networks
8/32/1/2
8
The Perceptron
The Linear Unit [Rosenblat, 1962]
w1
w2
wn
w0
x1
x2
xn
x0=1
.
.
.
Σ AAA
AAA
AAA
Σ wi xi
n
i=0 1 if > 0
-1 otherwise
{
o =
Σ wi xi
n
i=0
Figure: The single layer perceptron.
o(x1, . . . , xn) =

1 if w0 + w1x1 + · · · + wnxn  0
−1 otherwise.
Sometimes we’ll use simpler vector notation:
o(⃗
x) =

1 if ⃗
w · ⃗
x  0
−1 otherwise.
Muhammad Adil Raja | Artificial Neural Networks
9/32/1/2
9
The Decision Surface of a Perceptron
x1
x2
+
+
-
-
+
-
x1
x2
(a) (b)
-
+ -
+
Figure: Caption
Represents some useful functions
▶ What weights represent g(x1, x2) = AND(x1, x2)?
But some functions not representable
▶ e.g., not linearly separable
▶ Therefore, we’ll want networks of these...
Muhammad Adil Raja | Artificial Neural Networks
10/32/1/2
10
Perceptron Training Rule
wi ← wi + ∆wi
where
∆wi = η(t − o)xi
Where:
▶ t = c(⃗
x) is target value
▶ o is perceptron output
▶ η is small constant (e.g., .1) called learning rate
Can prove it will converge ([Minsky and Pappert, 1969])
▶ If training data is linearly separable
▶ and η sufficiently small
Muhammad Adil Raja | Artificial Neural Networks
11/32/1/2
11
The Gradient Descent Learning Rule
-1
0
1
2
-2
-1
0
1
2
3
0
5
10
15
20
25
w0 w1
E[w]
Figure: Caption
Gradient
∇E[⃗
w] ≡

∂E
∂w0
,
∂E
∂w1
, · · ·
∂E
∂wn

Training rule:
∆⃗
w = −η∇E[⃗
w]
i.e.,
∆wi = −η
∂E
∂wi
Muhammad Adil Raja | Artificial Neural Networks
12/32/1/2
12
The Gradient Descent Rule for the Linear Unit I
∂E
∂wi
=
∂
∂wi
1
2
X
d
(td − od )2
=
1
2
X
d
∂
∂wi
(td − od )2
=
1
2
X
d
2(td − od )
∂
∂wi
(td − od )
=
X
d
(td − od )
∂
∂wi
(td − ⃗
w · ⃗
xd )
∂E
∂wi
=
X
d
(td − od )(−xi,d )
Muhammad Adil Raja | Artificial Neural Networks
13/32/2/2
13
The Gradient Descent Rule for the Linear Unit II
Each training example is a pair of the form ⟨⃗
x, t⟩, where ⃗
x is
the vector of input values, and t is the target output value. η
is the learning rate (e.g., .05).
▶ Initialize each wi to some small random value
▶ Until the termination condition is met, Do
▶ Initialize each ∆wi to zero.
▶ For each ⟨⃗
x, t⟩ in training_examples, Do
▶ Input the instance ⃗
x to the unit and compute the output o
▶ For each linear unit weight wi , Do
∆wi ← ∆wi + η(t − o)xi
▶ For each linear unit weight wi , Do
wi ← wi + ∆wi
Muhammad Adil Raja | Artificial Neural Networks
14/32/1/2
14
Convergence
[Hertz et al., 1991]
The gradient descent training rule used by the linear unit is
guaranteed to converge to a hypothesis with minimum squared error
▶ given a sufficiently small learning rate η
▶ even when the training data contains noise
▶ even when the training data is not separable by H.
Note: If η is too large, the gradient descent search runs the risk of
overstepping the minimum in the error surface rather than settling into
it. For this reason, one common modification of the algorithm is to
gradually reduce the value of η as the number of gradient descent
steps grows.
Muhammad Adil Raja | Artificial Neural Networks
15/32/1/2
15
Incremental (Stochastic) Gradient Descent
Batch mode Gradient Descent:
Do until satisfied
1. Compute the gradient ∇ED[⃗
w]
2. ⃗
w ← ⃗
w − η∇ED[⃗
w]
Incremental mode Gradient
Descent:
Do until satisfied
▶ For each training example d
in D
1. Compute the gradient
∇Ed [⃗
w]
2. ⃗
w ← ⃗
w − η∇Ed [⃗
w]
ED[⃗
w] ≡
1
2
X
d∈D
(td − od )2
Ed [⃗
w] ≡
1
2
(td − od )2
Incremental Gradient Descent can approximate Batch Gradient
Descent arbitrarily closely if η made small enough.
Muhammad Adil Raja | Artificial Neural Networks
16/32/1/2
16
The Sigmoid Unit I
w1
w2
wn
w0
x1
x2
xn
x0 = 1
AA
AA
AA
AA
.
.
.
Σ net = Σ wi xi
i=0
n
1
1 + e
-net
o = σ(net) =
Figure: Caption
Muhammad Adil Raja | Artificial Neural Networks
17/32/2/2
17
The Sigmoid Unit II
σ(x) is the sigmoid function
1
1 + e−x
Nice property: dσ(x)
dx = σ(x)(1 − σ(x))
We can derive gradient decent rules to train
▶ One sigmoid unit.
▶ Multilayer networks of sigmoid units → Backpropagation.
Muhammad Adil Raja | Artificial Neural Networks
18/32/1/2
18
Error Gradient of the Sigmoid Unit I
∂E
∂wi
=
∂
∂wi
1
2
X
d∈D
(td − od )2
=
1
2
X
d
∂
∂wi
(td − od )2
=
1
2
X
d
2(td − od )
∂
∂wi
(td − od )
=
X
d
(td − od )

−
∂od
∂wi

= −
X
d
(td − od )
∂od
∂netd
∂netd
∂wi
Muhammad Adil Raja | Artificial Neural Networks
19/32/2/2
19
Error Gradient of the Sigmoid Unit II
But we know:
∂od
∂netd
=
∂σ(netd )
∂netd
= od (1 − od )
∂netd
∂wi
=
∂(⃗
w · ⃗
xd )
∂wi
= xi,d
So:
∂E
∂wi
= −
X
d∈D
(td − od )od (1 − od )xi,d
Muhammad Adil Raja | Artificial Neural Networks
20/32/1/3
20
Multilayer Networks of Sigmoid Units I
This network was trained to
recognize 1 of 10 vowel sounds
occurring in the context h...d (e.g.
head, hid).
The inputs have been obtained
from a spectral analysis of sound.
The 10 network outputs
correspond to the 10 possible
vowel sounds.
The network prediction is the
output whose value is the highest.
F1 F2
head hid who’d hood
... ...
Muhammad Adil Raja | Artificial Neural Networks
21/32/2/3
21
Multilayer Networks of Sigmoid Units II
This plot illustrates the highly
non-linear decision surface
represented by the learned
network.
Points shown on the plot are test
examples distinct from the
examples used to train the
network.
Muhammad Adil Raja | Artificial Neural Networks
22/32/1/3
22
The Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satisfied, Do
▶ For each training example, Do
1. Input the training example to the network and compute the network
outputs
2. For each output unit k
δk ← ok (1 − ok )(tk − ok )
3. For each hidden unit h
δh ← oh(1 − oh)
X
k∈outputs
wh,k δk
4. Update each network weight wi,j
wi,j ← wi,j + ∆wi,j
where
∆wi,j = ηδj xi,j
Muhammad Adil Raja | Artificial Neural Networks
23/32/1/3
23
More on Back Propagation
▶ Gradient descent over entire network weight vector
▶ Easily generalized to arbitrary directed graphs
▶ Will find a local, not necessarily global error minimum
▶ In practice, often works well (can run multiple times)
▶ Often include weight momentum α
∆wi,j (n) = ηδj xi,j + α∆wi,j (n − 1)
▶ Minimizes error over training examples
▶ Will it generalize well to subsequent examples?
▶ Training can take thousands of iterations → slow!
▶ Using network after training is very fast
Muhammad Adil Raja | Artificial Neural Networks
24/32/1/3
24
Learning Hidden Layer Representations I
Inputs Outputs A target function:
Input Output
10000000 → 10000000
01000000 → 01000000
00100000 → 00100000
00010000 → 00010000
00001000 → 00001000
00000100 → 00000100
00000010 → 00000010
00000001 → 00000001
Can this be learned??
Learned hidden layer representation:
Muhammad Adil Raja | Artificial Neural Networks
25/32/2/3
25
Learning Hidden Layer Representations II
Input Hidden Output
Values
10000000 → .89 .04 .08 → 10000000
01000000 → .01 .11 .88 → 01000000
00100000 → .01 .97 .27 → 00100000
00010000 → .99 .97 .71 → 00010000
00001000 → .03 .05 .02 → 00001000
00000100 → .22 .99 .99 → 00000100
00000010 → .80 .01 .98 → 00000010
00000001 → .60 .94 .01 → 00000001
Muhammad Adil Raja | Artificial Neural Networks
26/32/1/3
26
Training I
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 500 1000 1500 2000 2500
Sum of squared errors for each output unit
Muhammad Adil Raja | Artificial Neural Networks
27/32/2/3
27
Training II
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000 2500
Hidden unit encoding for input 01000000
Muhammad Adil Raja | Artificial Neural Networks
28/32/3/3
28
Training III
-5
-4
-3
-2
-1
0
1
2
3
4
0 500 1000 1500 2000 2500
Weights from inputs to one hidden unit
Muhammad Adil Raja | Artificial Neural Networks
29/32/1/3
29
Convergence of Back Propagation
Gradient descent to some local minimum
▶ Perhaps not global minimum...
▶ Add momentum
▶ Stochastic gradient descent
▶ Train multiple nets with different inital weights
Nature of convergence
▶ Initialize weights near zero
▶ Therefore, initial networks near-linear
▶ Increasingly non-linear functions possible as training progresses
Muhammad Adil Raja | Artificial Neural Networks
30/32/1/3
30
Expressive Capabilities of ANNs
Boolean functions:
▶ Every boolean function can be represented by network with
single hidden layer
▶ but might require exponential (in number of inputs) hidden units
Continuous functions:
▶ Every bounded continuous function can be approximated with
arbitrarily small error, by network with one hidden layer [Cybenko
1989; Hornik et al. 1989]
▶ Any function can be approximated to arbitrary accuracy by a
network with two hidden layers [Cybenko 1988].
Muhammad Adil Raja | Artificial Neural Networks
31/32/1/4
31
Alternative Error Functions
Penalize large weights:
E(⃗
w) ≡
1
2
X
d∈D
X
k∈outputs
(tkd − okd )2
+ γ
X
i,j
w2
ji
Train on target slopes as well as values:
E(⃗
w) ≡
1
2
X
d∈D
X
k∈outputs

(tkd − okd )2
+ µ
X
j∈inputs
∂tkd
∂xj
d
−
∂okd
∂xj
d
!2


Tie together weights:
e.g., in phoneme recognition network
Muhammad Adil Raja | Artificial Neural Networks
32/32/1/4
32
Recurrent Networks
x(t) x(t) c(t)
x(t) c(t)
y(t)
b
y(t + 1)
Feedforward network Recurrent network
Recurrent network
unfolded in time
y(t + 1)
y(t + 1)
y(t – 1)
x(t – 1) c(t – 1)
x(t – 2) c(t – 2)
(a) (b)
(c)
Muhammad Adil Raja | Artificial Neural Networks

More Related Content

PPT
Artificial Neural Network
Pratik Aggarwal
 
PDF
Machine Learning 1
cairo university
 
PPTX
Reinforcement Learning and Artificial Neural Nets
Pierre de Lacaze
 
PPTX
UNIT III (8).pptx
DrDhivyaaCRAssistant
 
PPTX
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Dhivyaa C.R
 
PPTX
ML_Unit_2_Part_A
Srimatre K
 
PPT
ann-ics320Part4.ppt
GayathriRHICETCSESTA
 
PPT
ann-ics320Part4.ppt
GayathriRHICETCSESTA
 
Artificial Neural Network
Pratik Aggarwal
 
Machine Learning 1
cairo university
 
Reinforcement Learning and Artificial Neural Nets
Pierre de Lacaze
 
UNIT III (8).pptx
DrDhivyaaCRAssistant
 
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Dhivyaa C.R
 
ML_Unit_2_Part_A
Srimatre K
 
ann-ics320Part4.ppt
GayathriRHICETCSESTA
 
ann-ics320Part4.ppt
GayathriRHICETCSESTA
 

Similar to ANNs.pdf (20)

PDF
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
mohanapriyastp
 
PPTX
DNN.pptx
someshleocola
 
PPTX
Lec10.pptx
AbrahamTadesse11
 
PDF
m3 (2).pdf
Devikashetty14
 
PPT
MPerceptron
butest
 
PPTX
04 Multi-layer Feedforward Networks
Tamer Ahmed Farrag, PhD
 
PPT
Neural
Vaibhav Shah
 
PDF
Lecture3 xing fei-fei
Tianlu Wang
 
PPTX
Artificial Neural Networks presentations
migob991
 
PPTX
Lecture9April2020_time_11_55amto12_50pm(Neural_network_PPT).pptx
VAIBHAVSAHU55
 
PPTX
UofT_ML_lecture.pptx
abcdefghijklmn19
 
PPT
INTRODUCTION TO ARTIFICIAL INTELLIGENCE.
SoumitraKundu4
 
PDF
Neural Networks. Overview
Oleksandr Baiev
 
PPTX
Deep Learning Sample Class (Jon Lederman)
Jon Lederman
 
PDF
Artificial Neural Network for machine learning
2303oyxxxjdeepak
 
PPT
AI-CH5 (ANN) - Artificial Neural Network
abrahadawit101
 
PDF
Capstone paper
Muhammad Saeed
 
PDF
W2 - Multilayer Neural Network Lecture notes university of sydney
mayanksharmadav
 
PPT
ai7 (1) Artificial Neural Network Intro .ppt
AiniBasit
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
mohanapriyastp
 
DNN.pptx
someshleocola
 
Lec10.pptx
AbrahamTadesse11
 
m3 (2).pdf
Devikashetty14
 
MPerceptron
butest
 
04 Multi-layer Feedforward Networks
Tamer Ahmed Farrag, PhD
 
Neural
Vaibhav Shah
 
Lecture3 xing fei-fei
Tianlu Wang
 
Artificial Neural Networks presentations
migob991
 
Lecture9April2020_time_11_55amto12_50pm(Neural_network_PPT).pptx
VAIBHAVSAHU55
 
UofT_ML_lecture.pptx
abcdefghijklmn19
 
INTRODUCTION TO ARTIFICIAL INTELLIGENCE.
SoumitraKundu4
 
Neural Networks. Overview
Oleksandr Baiev
 
Deep Learning Sample Class (Jon Lederman)
Jon Lederman
 
Artificial Neural Network for machine learning
2303oyxxxjdeepak
 
AI-CH5 (ANN) - Artificial Neural Network
abrahadawit101
 
Capstone paper
Muhammad Saeed
 
W2 - Multilayer Neural Network Lecture notes university of sydney
mayanksharmadav
 
ai7 (1) Artificial Neural Network Intro .ppt
AiniBasit
 
Ad

More from adil raja (20)

PDF
A Software Requirements Specification
adil raja
 
PDF
NUAV - A Testbed for Development of Autonomous Unmanned Aerial Vehicles
adil raja
 
PDF
DevOps Demystified
adil raja
 
PDF
On Research (And Development)
adil raja
 
PDF
Simulators as Drivers of Cutting Edge Research
adil raja
 
PDF
The Knock Knock Protocol
adil raja
 
PDF
File Transfer Through Sockets
adil raja
 
PDF
Remote Command Execution
adil raja
 
PDF
Thesis
adil raja
 
PDF
CMM Level 3 Assessment of Xavor Pakistan
adil raja
 
PDF
Data Warehousing
adil raja
 
PDF
Implementation of a Non-Intrusive Speech Quality Assessment Tool on a Mid-Net...
adil raja
 
PDF
Implementation of a Non-Intrusive Speech Quality Assessment Tool on a Mid-Net...
adil raja
 
PDF
Real-Time Non-Intrusive Speech Quality Estimation for VoIP
adil raja
 
PDF
VoIP
adil raja
 
PDF
ULMAN GUI Specifications
adil raja
 
PDF
Modeling the Effect of Packet Loss on Speech Quality: Genetic Programming Bas...
adil raja
 
PDF
ULMAN-GUI
adil raja
 
PDF
Modeling the Effect of Packet Loss on Speech Quality: Genetic Programming Bas...
adil raja
 
PDF
Modeling the Effect of packet Loss on Speech Quality: GP Based Symbolic Regre...
adil raja
 
A Software Requirements Specification
adil raja
 
NUAV - A Testbed for Development of Autonomous Unmanned Aerial Vehicles
adil raja
 
DevOps Demystified
adil raja
 
On Research (And Development)
adil raja
 
Simulators as Drivers of Cutting Edge Research
adil raja
 
The Knock Knock Protocol
adil raja
 
File Transfer Through Sockets
adil raja
 
Remote Command Execution
adil raja
 
Thesis
adil raja
 
CMM Level 3 Assessment of Xavor Pakistan
adil raja
 
Data Warehousing
adil raja
 
Implementation of a Non-Intrusive Speech Quality Assessment Tool on a Mid-Net...
adil raja
 
Implementation of a Non-Intrusive Speech Quality Assessment Tool on a Mid-Net...
adil raja
 
Real-Time Non-Intrusive Speech Quality Estimation for VoIP
adil raja
 
VoIP
adil raja
 
ULMAN GUI Specifications
adil raja
 
Modeling the Effect of Packet Loss on Speech Quality: Genetic Programming Bas...
adil raja
 
ULMAN-GUI
adil raja
 
Modeling the Effect of Packet Loss on Speech Quality: Genetic Programming Bas...
adil raja
 
Modeling the Effect of packet Loss on Speech Quality: GP Based Symbolic Regre...
adil raja
 
Ad

Recently uploaded (20)

PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
The Future of Artificial Intelligence (AI)
Mukul
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

ANNs.pdf

  • 1. 0/32/1/0 0 Regulated Software Research Centre (RSRC) Dundalk Institute of Technology Artificial Neural Networks Muhammad Adil Raja July 6, 2023 Muhammad Adil Raja | Artificial Neural Networks
  • 2. 1/32/1/0 1 Content Introduction Connectionist Models 2 ANN examples: ALVINN driving system, face recognition The Perceptron The Linear Unit The Sigmoid Unit The Gradient Descent Learning Rule Multilayer Networks of Sigmoid Units The Backpropagation Algorithm Hidden Layer Representations Overfitting in NNs Advanced Topics Alternative Error Functions Recurrent Neural Networks Muhammad Adil Raja | Artificial Neural Networks
  • 3. 2/32/1/1 2 Connectionist Models Consider humans: ▶ Neuron switching time ˜ .001 second ▶ Number of neurons ˜ 1010 ▶ Connections per neuron ˜ 104−5 ▶ Scene recognition time ˜ .1 second ▶ 100 inference steps doesn’t seem like enough → much parallel computation Properties of artificial neural nets (ANN’s): ▶ Many neuron-like threshold switching units. ▶ Many weighted interconnections among units. ▶ Highly parallel, distributed process. ▶ Emphasis on tuning weights automatically Muhammad Adil Raja | Artificial Neural Networks
  • 4. 3/32/1/1 3 A First ANN Example: ALVINN Drives at 70 mph on Highways 4 Hidden Units 30 Output Units 30x32 Sensor Input Retina Muhammad Adil Raja | Artificial Neural Networks
  • 5. 4/32/1/1 4 A Second ANN Example: ANNs for Face Recognition ... ... Figure: 30×32 inputs. Figure: Results: 90% accurate learning head pose, and recognizing 1 – of – 20 faces. Muhammad Adil Raja | Artificial Neural Networks
  • 6. 5/32/1/1 5 Learned Weights Muhammad Adil Raja | Artificial Neural Networks
  • 7. 6/32/1/1 6 When to Consider ANNs ▶ Input is high-dimensional discrete or real-valued (e.g. raw sensor input) ▶ Output is discrete or real-valued ▶ Output is a vector of values ▶ Possibly noisy data ▶ Form of the target function is unknown ▶ Human readability of result is unimportant Examples: ▶ Speech phoneme recognition [Waibel]. ▶ Image classification [Kanade, Baluja, Rowley]. ▶ Financial prediction. Muhammad Adil Raja | Artificial Neural Networks
  • 8. 7/32/1/1 7 A Timeless Meme Figure: A very relevant meme. Muhammad Adil Raja | Artificial Neural Networks
  • 9. 8/32/1/2 8 The Perceptron The Linear Unit [Rosenblat, 1962] w1 w2 wn w0 x1 x2 xn x0=1 . . . Σ AAA AAA AAA Σ wi xi n i=0 1 if > 0 -1 otherwise { o = Σ wi xi n i=0 Figure: The single layer perceptron. o(x1, . . . , xn) = 1 if w0 + w1x1 + · · · + wnxn 0 −1 otherwise. Sometimes we’ll use simpler vector notation: o(⃗ x) = 1 if ⃗ w · ⃗ x 0 −1 otherwise. Muhammad Adil Raja | Artificial Neural Networks
  • 10. 9/32/1/2 9 The Decision Surface of a Perceptron x1 x2 + + - - + - x1 x2 (a) (b) - + - + Figure: Caption Represents some useful functions ▶ What weights represent g(x1, x2) = AND(x1, x2)? But some functions not representable ▶ e.g., not linearly separable ▶ Therefore, we’ll want networks of these... Muhammad Adil Raja | Artificial Neural Networks
  • 11. 10/32/1/2 10 Perceptron Training Rule wi ← wi + ∆wi where ∆wi = η(t − o)xi Where: ▶ t = c(⃗ x) is target value ▶ o is perceptron output ▶ η is small constant (e.g., .1) called learning rate Can prove it will converge ([Minsky and Pappert, 1969]) ▶ If training data is linearly separable ▶ and η sufficiently small Muhammad Adil Raja | Artificial Neural Networks
  • 12. 11/32/1/2 11 The Gradient Descent Learning Rule -1 0 1 2 -2 -1 0 1 2 3 0 5 10 15 20 25 w0 w1 E[w] Figure: Caption Gradient ∇E[⃗ w] ≡ ∂E ∂w0 , ∂E ∂w1 , · · · ∂E ∂wn Training rule: ∆⃗ w = −η∇E[⃗ w] i.e., ∆wi = −η ∂E ∂wi Muhammad Adil Raja | Artificial Neural Networks
  • 13. 12/32/1/2 12 The Gradient Descent Rule for the Linear Unit I ∂E ∂wi = ∂ ∂wi 1 2 X d (td − od )2 = 1 2 X d ∂ ∂wi (td − od )2 = 1 2 X d 2(td − od ) ∂ ∂wi (td − od ) = X d (td − od ) ∂ ∂wi (td − ⃗ w · ⃗ xd ) ∂E ∂wi = X d (td − od )(−xi,d ) Muhammad Adil Raja | Artificial Neural Networks
  • 14. 13/32/2/2 13 The Gradient Descent Rule for the Linear Unit II Each training example is a pair of the form ⟨⃗ x, t⟩, where ⃗ x is the vector of input values, and t is the target output value. η is the learning rate (e.g., .05). ▶ Initialize each wi to some small random value ▶ Until the termination condition is met, Do ▶ Initialize each ∆wi to zero. ▶ For each ⟨⃗ x, t⟩ in training_examples, Do ▶ Input the instance ⃗ x to the unit and compute the output o ▶ For each linear unit weight wi , Do ∆wi ← ∆wi + η(t − o)xi ▶ For each linear unit weight wi , Do wi ← wi + ∆wi Muhammad Adil Raja | Artificial Neural Networks
  • 15. 14/32/1/2 14 Convergence [Hertz et al., 1991] The gradient descent training rule used by the linear unit is guaranteed to converge to a hypothesis with minimum squared error ▶ given a sufficiently small learning rate η ▶ even when the training data contains noise ▶ even when the training data is not separable by H. Note: If η is too large, the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it. For this reason, one common modification of the algorithm is to gradually reduce the value of η as the number of gradient descent steps grows. Muhammad Adil Raja | Artificial Neural Networks
  • 16. 15/32/1/2 15 Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient ∇ED[⃗ w] 2. ⃗ w ← ⃗ w − η∇ED[⃗ w] Incremental mode Gradient Descent: Do until satisfied ▶ For each training example d in D 1. Compute the gradient ∇Ed [⃗ w] 2. ⃗ w ← ⃗ w − η∇Ed [⃗ w] ED[⃗ w] ≡ 1 2 X d∈D (td − od )2 Ed [⃗ w] ≡ 1 2 (td − od )2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough. Muhammad Adil Raja | Artificial Neural Networks
  • 17. 16/32/1/2 16 The Sigmoid Unit I w1 w2 wn w0 x1 x2 xn x0 = 1 AA AA AA AA . . . Σ net = Σ wi xi i=0 n 1 1 + e -net o = σ(net) = Figure: Caption Muhammad Adil Raja | Artificial Neural Networks
  • 18. 17/32/2/2 17 The Sigmoid Unit II σ(x) is the sigmoid function 1 1 + e−x Nice property: dσ(x) dx = σ(x)(1 − σ(x)) We can derive gradient decent rules to train ▶ One sigmoid unit. ▶ Multilayer networks of sigmoid units → Backpropagation. Muhammad Adil Raja | Artificial Neural Networks
  • 19. 18/32/1/2 18 Error Gradient of the Sigmoid Unit I ∂E ∂wi = ∂ ∂wi 1 2 X d∈D (td − od )2 = 1 2 X d ∂ ∂wi (td − od )2 = 1 2 X d 2(td − od ) ∂ ∂wi (td − od ) = X d (td − od ) − ∂od ∂wi = − X d (td − od ) ∂od ∂netd ∂netd ∂wi Muhammad Adil Raja | Artificial Neural Networks
  • 20. 19/32/2/2 19 Error Gradient of the Sigmoid Unit II But we know: ∂od ∂netd = ∂σ(netd ) ∂netd = od (1 − od ) ∂netd ∂wi = ∂(⃗ w · ⃗ xd ) ∂wi = xi,d So: ∂E ∂wi = − X d∈D (td − od )od (1 − od )xi,d Muhammad Adil Raja | Artificial Neural Networks
  • 21. 20/32/1/3 20 Multilayer Networks of Sigmoid Units I This network was trained to recognize 1 of 10 vowel sounds occurring in the context h...d (e.g. head, hid). The inputs have been obtained from a spectral analysis of sound. The 10 network outputs correspond to the 10 possible vowel sounds. The network prediction is the output whose value is the highest. F1 F2 head hid who’d hood ... ... Muhammad Adil Raja | Artificial Neural Networks
  • 22. 21/32/2/3 21 Multilayer Networks of Sigmoid Units II This plot illustrates the highly non-linear decision surface represented by the learned network. Points shown on the plot are test examples distinct from the examples used to train the network. Muhammad Adil Raja | Artificial Neural Networks
  • 23. 22/32/1/3 22 The Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do ▶ For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k δk ← ok (1 − ok )(tk − ok ) 3. For each hidden unit h δh ← oh(1 − oh) X k∈outputs wh,k δk 4. Update each network weight wi,j wi,j ← wi,j + ∆wi,j where ∆wi,j = ηδj xi,j Muhammad Adil Raja | Artificial Neural Networks
  • 24. 23/32/1/3 23 More on Back Propagation ▶ Gradient descent over entire network weight vector ▶ Easily generalized to arbitrary directed graphs ▶ Will find a local, not necessarily global error minimum ▶ In practice, often works well (can run multiple times) ▶ Often include weight momentum α ∆wi,j (n) = ηδj xi,j + α∆wi,j (n − 1) ▶ Minimizes error over training examples ▶ Will it generalize well to subsequent examples? ▶ Training can take thousands of iterations → slow! ▶ Using network after training is very fast Muhammad Adil Raja | Artificial Neural Networks
  • 25. 24/32/1/3 24 Learning Hidden Layer Representations I Inputs Outputs A target function: Input Output 10000000 → 10000000 01000000 → 01000000 00100000 → 00100000 00010000 → 00010000 00001000 → 00001000 00000100 → 00000100 00000010 → 00000010 00000001 → 00000001 Can this be learned?? Learned hidden layer representation: Muhammad Adil Raja | Artificial Neural Networks
  • 26. 25/32/2/3 25 Learning Hidden Layer Representations II Input Hidden Output Values 10000000 → .89 .04 .08 → 10000000 01000000 → .01 .11 .88 → 01000000 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .22 .99 .99 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001 Muhammad Adil Raja | Artificial Neural Networks
  • 27. 26/32/1/3 26 Training I 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 500 1000 1500 2000 2500 Sum of squared errors for each output unit Muhammad Adil Raja | Artificial Neural Networks
  • 28. 27/32/2/3 27 Training II 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 2500 Hidden unit encoding for input 01000000 Muhammad Adil Raja | Artificial Neural Networks
  • 29. 28/32/3/3 28 Training III -5 -4 -3 -2 -1 0 1 2 3 4 0 500 1000 1500 2000 2500 Weights from inputs to one hidden unit Muhammad Adil Raja | Artificial Neural Networks
  • 30. 29/32/1/3 29 Convergence of Back Propagation Gradient descent to some local minimum ▶ Perhaps not global minimum... ▶ Add momentum ▶ Stochastic gradient descent ▶ Train multiple nets with different inital weights Nature of convergence ▶ Initialize weights near zero ▶ Therefore, initial networks near-linear ▶ Increasingly non-linear functions possible as training progresses Muhammad Adil Raja | Artificial Neural Networks
  • 31. 30/32/1/3 30 Expressive Capabilities of ANNs Boolean functions: ▶ Every boolean function can be represented by network with single hidden layer ▶ but might require exponential (in number of inputs) hidden units Continuous functions: ▶ Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] ▶ Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. Muhammad Adil Raja | Artificial Neural Networks
  • 32. 31/32/1/4 31 Alternative Error Functions Penalize large weights: E(⃗ w) ≡ 1 2 X d∈D X k∈outputs (tkd − okd )2 + γ X i,j w2 ji Train on target slopes as well as values: E(⃗ w) ≡ 1 2 X d∈D X k∈outputs  (tkd − okd )2 + µ X j∈inputs ∂tkd ∂xj d − ∂okd ∂xj d !2   Tie together weights: e.g., in phoneme recognition network Muhammad Adil Raja | Artificial Neural Networks
  • 33. 32/32/1/4 32 Recurrent Networks x(t) x(t) c(t) x(t) c(t) y(t) b y(t + 1) Feedforward network Recurrent network Recurrent network unfolded in time y(t + 1) y(t + 1) y(t – 1) x(t – 1) c(t – 1) x(t – 2) c(t – 2) (a) (b) (c) Muhammad Adil Raja | Artificial Neural Networks