Neural NetworksUniversal Function Approximators
-Prakhar Mishra
Agenda
● Machine Learning Refresher
○ An Example
○ Hierarchical Division
○ Split Ratio
○ Evaluation Metric
● Neural Networks
○ Inspiration
○ Computation Graph
○ Architecture
○ Hyperparameters
○ Regularization
○ Backpropagation
Machine Learning - Quick Refresher
Machine Learning - Quick Refresher
Machine Learning - Quick Refresher
Feature Engineering
Machine Learning - Quick Refresher
Figure out yourself
Machine Learning - Quick Refresher
Machine Learning - Quick Refresher
Machine Learning - Quick Refresher
70%-80% 30%-20%
Machine Learning - Evaluation Metrics
● Confusion Matrix
○ Evaluation for performance of classification model
● Accuracy = (TP + TN) /total samples
Machine Learning - Evaluation Metrics
● Root Mean Squared Error
○ Spread of the predicted y-values about the original y-values.
N = Total Samples
Yi
= Predicted
Yi
= Actual
Rise of Neural Nets
Scale drives
Deep Learning
Learning from Data
Structured Unstructured
Neural Nets - Supervised
Input Output Application
Home Features Cost Real Estate
Ad, User Information Click on Ad ? Online Advertising
Image (1...1000) Class Photo Tagging
Audio Text Speech Recognition
English Chinese Machine Translation
Computation Graph
J(a, b, c) = 3(a + bc)
U = bc
V = a + U
J = 3V
Substitution
U=b*c
b
c
a V= a+U J = 3V
Input
a = 5
b = 3
c = 2
How does J
change if we
change V a bit?
11
33
6
How does J
change if we
change a a bit?
a→V→J
∂J/∂a = (∂J/∂V) x (∂V/∂a)
How does J
change if we
change b a bit?
b→U→V→J
∂J/∂b = (∂J/∂V) x (∂V/∂U) x (∂U/∂b)
Forward →
Backward ←
Architecture
w1
i
1
i2
.
.
in
wn
o1
on
.
.
xF
F = Activation Function
X = w1
*i1
+ w2
*i2
+ . . +wn
*in
+ b
3 Layer NN
Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
Weight Initialization
● If the weights in a network start too small, then the signal shrinks as it
passes through each layer until it’s too tiny to be useful.
● If the weights in a network start too large, then the signal grows as it
passes through each layer until it’s too massive to be useful.
-
Xavier Initialization
-
Weight Initialization
Wi
= √(2 / ni
)
Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
Loss Functions
● Binary Cross Entropy
● Categorical Cross Entropy
● Root Mean Squared Error
Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
Optimization Functions
● Adagrad Optimizer
● Gradient Descent Optimizer
● Adams Optimizer
● Stochastic Gradient Descent Optimizer
● RMSProp Optimizer
Optimization Functions - Adam
Optimization Functions - Adam
Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
Learning Rate
● Decaying the Learning Rate overtime is seen to fasten the learning
process/convergence.
Learning Rate- Intuition
Learning Rate- Formula
1
1 + decay x learning_rate
Alpha0Alpha1
Learning Rate- Special Case
Wi
= Wi-1
+ Alpha x Slope
Pseudo Self Adaptive in
Convex Curve
Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
Activation Functions
Biologically inspired by activity of our brain, where different neurons are
activated by different stimuli.
Activation Functions - Sigmoid
Activation Functions - Tanh
Activation Functions - ReLU
Activation Functions - Standards
● In practice, Tanh outperforms Sigmoid for internal layers.
○ Mean 0, Tanh Function.
○ Mean 0, Sigmoid Function.
○ In ML, we tend to center our data to avoid any kind of bias behaviour.
● Rule of thumb, ReLU for hidden layers generally performs well.
● Avoid Sigmoid for hidden layers.
● Sigmoid is a good candidate for Binary Classification problem.
● Identity Function for hidden layers - No Sense
Activation Functions - ReLU or Tanh ?
ReLU > Tanh
-
Avoids Vanishing Gradient
-
Is it the best ? [No]
Activation Functions - Why ?
Because
fLinear fLinear = fLinear = (N) Layers = (N-X) Layers
-
Trivial Functions are learned
-
Activation Functions - Why ?
● More Advanced Functions - Nonlinear.
● Should be Differentiable - for Backpropagation.
Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
Batch Size
● The Batch Size is the number of samples that will be passed through the
network at a time.
● Advantages
○ Your machine might not fit all the data in-memory at any given instance.
○ You want your model to generalize quickly.
Training - Pre:1
Derivative
Training - Pre:2
Partial Derivative
Training - Pre:3
Chain Rule
Training - Example
0.05
0.10
0.02
Xi
(Input)
Input
0.15
0.30
0.20
Weights
H1
H2
H3
X1
X2
X3
O1 Y
Output
0.33
Input Layer
Hidden Layer
Output Layer
Training - Forward Propagation
Hi
= ∑i=1
wi
*xi
(Compact Representation)
H1
= w1
*x1
+ w2
*x2
+ w3
*x3
(Expanded Representation)
H1
= 0.15*0.05 + 0.20*0.10 + 0.30*0.02
H1
= σ(0.0335) = Hσ1
O1
= ∑i=0
Hi
*wi
(Compact Representation)
O1
= H1
*0.33 = 0.0335*0.33
O1
= σ(0.011055) = Oσ1
σ = 1 / 1 + e-H
Error = |Y - Yi
|
Training - Backward Propagation
The Goal,
Is to update each of the weights in the network so that they
cause the actual output to be closer the target output.
Training - Backward Propagation
∂Error/∂w4
= (∂Error/∂O 1
) x (∂O 1
/∂O1
) x (∂O1
/∂w4
)
∂Error/∂wi
= Partial derivative w.r.t wi
w4
O 1O1
w
4
Error
∂Error/∂w1
= (∂Error/∂H 1
) x (∂H 1
/∂H1
) x (∂H1
/∂w1
)
= |Y - Yi
|
w1
Training - Backward Propagation
0.33
0.04
H1
Etotal
= Eo1
+ Eo2
E0
E1
∂Etotal
/∂H 1
= ∂Eo1
/∂H 1
+ ∂Eo2
/∂H 1
∂Eo1
/∂H 1
= (∂Eo1
/∂H1
x (∂H1
/∂H 1
)
∂Eo2
/∂H 1
= (∂Eo2
/∂H1
) x (∂H1
/∂H 1
)
∂Etotal
/∂w1
= (∂Etotal
/∂H 1
) x (∂H 1
/∂H1
) x (∂H1
/∂w1
)
w1
H 1
Works perfect on training data ?
Regularization
Technique for preventing overfitting
Regularization reduces overfitting by adding a penalty to the loss function
Regularization- Dropout
● Dropout refers to ignoring units (i.e. neurons) during the training phase of
certain set of neurons which is chosen at random.
● Avoids co-dependency amongst neurons during training.
● Dropout with a given probability (20%-50%) in each weight update cycle.
● Dropout at each layer of the network has shown good results.
Regularization- Dropout
References
● Adam Optimization
● Andrew Ng Youtube
● Siraj Raval Youtube
● Adam Optimization
● Cross Entropy
● Deep Learning Basics
● BackPropagation

More Related Content

PDF
Safe and Efficient Off-Policy Reinforcement Learning
PPTX
Machine learning & Time Series Analysis , Finlab CTO 韓承佑
PPTX
Exploring Optimization in Vowpal Wabbit
PDF
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PDF
Deep Generative Models
PDF
Gradient Boosted Regression Trees in scikit-learn
PDF
Lecture 5: Neural Networks II
Safe and Efficient Off-Policy Reinforcement Learning
Machine learning & Time Series Analysis , Finlab CTO 韓承佑
Exploring Optimization in Vowpal Wabbit
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Deep Generative Models
Gradient Boosted Regression Trees in scikit-learn
Lecture 5: Neural Networks II

What's hot (19)

PPTX
Introduction to Deep learning and H2O for beginner's
PPTX
Techniques in Deep Learning
PDF
Deep Learning in Finance
PPTX
Introduction of "TrailBlazer" algorithm
PPTX
Deep learning with TensorFlow
PPTX
Ultrasound Nerve Segmentation
PDF
Introducton to Convolutional Nerural Network with TensorFlow
PPTX
Differential privacy without sensitivity [NIPS2016読み会資料]
PDF
Training Neural Networks
PDF
Improving Variational Inference with Inverse Autoregressive Flow
PDF
Chap 8. Optimization for training deep models
PDF
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
Anomaly Detection by ADGM / LVAE
PDF
Generative adversarial networks
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
VAE-type Deep Generative Models
PPTX
Reinforcement Learning for Self Driving Cars
PDF
Overview of TensorFlow For Natural Language Processing
PPTX
Reading group gan - 20170417
Introduction to Deep learning and H2O for beginner's
Techniques in Deep Learning
Deep Learning in Finance
Introduction of "TrailBlazer" algorithm
Deep learning with TensorFlow
Ultrasound Nerve Segmentation
Introducton to Convolutional Nerural Network with TensorFlow
Differential privacy without sensitivity [NIPS2016読み会資料]
Training Neural Networks
Improving Variational Inference with Inverse Autoregressive Flow
Chap 8. Optimization for training deep models
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Anomaly Detection by ADGM / LVAE
Generative adversarial networks
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
VAE-type Deep Generative Models
Reinforcement Learning for Self Driving Cars
Overview of TensorFlow For Natural Language Processing
Reading group gan - 20170417
Ad

Similar to Neural networks (20)

PPTX
Comprehensive Guide to Neural Networks in Machine Learning and Deep Learning ...
PPTX
Deep learning crash course
PPTX
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
PPTX
UNIT IV NEURAL NETWORKS - Multilayer perceptron
PPTX
Introduction to Neural Networks and Deep Learning from Scratch
PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
PPTX
08 neural networks
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
PPTX
Neural Networks in Artificial intelligence
PPTX
Module1 (2).pptxvgybhunjimko,l.vgbyhnjmk;
PDF
Deep Learning Study _ FInalwithCNN_RNN_LSTM_GRU.pdf
PPTX
Nimrita deep learning
PPT
Neural networks,Single Layer Feed Forward
PPTX
Neural network basic and introduction of Deep learning
PPT
Artificial Neural Network
PDF
Building Applications with Apache MXNet
PDF
Deep Feed Forward Neural Networks and Regularization
PPTX
Machine Learning Essentials Demystified part2 | Big Data Demystified
PDF
Artificial Neural Network for machine learning
PPTX
simple NN and RBM arch for slideshare.pptx
Comprehensive Guide to Neural Networks in Machine Learning and Deep Learning ...
Deep learning crash course
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
UNIT IV NEURAL NETWORKS - Multilayer perceptron
Introduction to Neural Networks and Deep Learning from Scratch
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
08 neural networks
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Neural Networks in Artificial intelligence
Module1 (2).pptxvgybhunjimko,l.vgbyhnjmk;
Deep Learning Study _ FInalwithCNN_RNN_LSTM_GRU.pdf
Nimrita deep learning
Neural networks,Single Layer Feed Forward
Neural network basic and introduction of Deep learning
Artificial Neural Network
Building Applications with Apache MXNet
Deep Feed Forward Neural Networks and Regularization
Machine Learning Essentials Demystified part2 | Big Data Demystified
Artificial Neural Network for machine learning
simple NN and RBM arch for slideshare.pptx
Ad

Recently uploaded (20)

PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
Statistics on Ai - sourced from AIPRM.pdf
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Enhancing plagiarism detection using data pre-processing and machine learning...
Basics of Cloud Computing - Cloud Ecosystem
Module 1 Introduction to Web Programming .pptx
Improvisation in detection of pomegranate leaf disease using transfer learni...
NewMind AI Weekly Chronicles – August ’25 Week IV
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
sustainability-14-14877-v2.pddhzftheheeeee
Comparative analysis of machine learning models for fake news detection in so...
Early detection and classification of bone marrow changes in lumbar vertebrae...
Statistics on Ai - sourced from AIPRM.pdf
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Custom Battery Pack Design Considerations for Performance and Safety
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Training Program for knowledge in solar cell and solar industry
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Consumable AI The What, Why & How for Small Teams.pdf
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...

Neural networks

  • 1. Neural NetworksUniversal Function Approximators -Prakhar Mishra
  • 2. Agenda ● Machine Learning Refresher ○ An Example ○ Hierarchical Division ○ Split Ratio ○ Evaluation Metric ● Neural Networks ○ Inspiration ○ Computation Graph ○ Architecture ○ Hyperparameters ○ Regularization ○ Backpropagation
  • 3. Machine Learning - Quick Refresher
  • 4. Machine Learning - Quick Refresher
  • 5. Machine Learning - Quick Refresher Feature Engineering
  • 6. Machine Learning - Quick Refresher Figure out yourself
  • 7. Machine Learning - Quick Refresher
  • 8. Machine Learning - Quick Refresher
  • 9. Machine Learning - Quick Refresher 70%-80% 30%-20%
  • 10. Machine Learning - Evaluation Metrics ● Confusion Matrix ○ Evaluation for performance of classification model ● Accuracy = (TP + TN) /total samples
  • 11. Machine Learning - Evaluation Metrics ● Root Mean Squared Error ○ Spread of the predicted y-values about the original y-values. N = Total Samples Yi = Predicted Yi = Actual
  • 12. Rise of Neural Nets Scale drives Deep Learning
  • 14. Neural Nets - Supervised Input Output Application Home Features Cost Real Estate Ad, User Information Click on Ad ? Online Advertising Image (1...1000) Class Photo Tagging Audio Text Speech Recognition English Chinese Machine Translation
  • 15. Computation Graph J(a, b, c) = 3(a + bc) U = bc V = a + U J = 3V Substitution U=b*c b c a V= a+U J = 3V Input a = 5 b = 3 c = 2 How does J change if we change V a bit? 11 33 6 How does J change if we change a a bit? a→V→J ∂J/∂a = (∂J/∂V) x (∂V/∂a) How does J change if we change b a bit? b→U→V→J ∂J/∂b = (∂J/∂V) x (∂V/∂U) x (∂U/∂b) Forward → Backward ←
  • 16. Architecture w1 i 1 i2 . . in wn o1 on . . xF F = Activation Function X = w1 *i1 + w2 *i2 + . . +wn *in + b 3 Layer NN
  • 17. Hyperparameters ● There are number of parameters that can be tuned in while building your neural network. ○ Number of Hidden Layers ○ Epochs ○ Loss Function ○ Optimization Function ○ Weight Initialization ○ Activation Functions ○ Batch Size ○ Learning Rate
  • 18. Weight Initialization ● If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful. ● If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful. - Xavier Initialization -
  • 20. Hyperparameters ● There are number of parameters that can be tuned in while building your neural network. ○ Number of Hidden Layers ○ Epochs ○ Loss Function ○ Optimization Function ○ Weight Initialization ○ Activation Functions ○ Batch Size ○ Learning Rate
  • 21. Loss Functions ● Binary Cross Entropy ● Categorical Cross Entropy ● Root Mean Squared Error
  • 22. Hyperparameters ● There are number of parameters that can be tuned in while building your neural network. ○ Number of Hidden Layers ○ Epochs ○ Loss Function ○ Optimization Function ○ Weight Initialization ○ Activation Functions ○ Batch Size ○ Learning Rate
  • 23. Optimization Functions ● Adagrad Optimizer ● Gradient Descent Optimizer ● Adams Optimizer ● Stochastic Gradient Descent Optimizer ● RMSProp Optimizer
  • 26. Hyperparameters ● There are number of parameters that can be tuned in while building your neural network. ○ Number of Hidden Layers ○ Epochs ○ Loss Function ○ Optimization Function ○ Weight Initialization ○ Activation Functions ○ Batch Size ○ Learning Rate
  • 27. Learning Rate ● Decaying the Learning Rate overtime is seen to fasten the learning process/convergence.
  • 29. Learning Rate- Formula 1 1 + decay x learning_rate Alpha0Alpha1
  • 30. Learning Rate- Special Case Wi = Wi-1 + Alpha x Slope Pseudo Self Adaptive in Convex Curve
  • 31. Hyperparameters ● There are number of parameters that can be tuned in while building your neural network. ○ Number of Hidden Layers ○ Epochs ○ Loss Function ○ Optimization Function ○ Weight Initialization ○ Activation Functions ○ Batch Size ○ Learning Rate
  • 32. Activation Functions Biologically inspired by activity of our brain, where different neurons are activated by different stimuli.
  • 36. Activation Functions - Standards ● In practice, Tanh outperforms Sigmoid for internal layers. ○ Mean 0, Tanh Function. ○ Mean 0, Sigmoid Function. ○ In ML, we tend to center our data to avoid any kind of bias behaviour. ● Rule of thumb, ReLU for hidden layers generally performs well. ● Avoid Sigmoid for hidden layers. ● Sigmoid is a good candidate for Binary Classification problem. ● Identity Function for hidden layers - No Sense
  • 37. Activation Functions - ReLU or Tanh ? ReLU > Tanh - Avoids Vanishing Gradient - Is it the best ? [No]
  • 38. Activation Functions - Why ? Because fLinear fLinear = fLinear = (N) Layers = (N-X) Layers - Trivial Functions are learned -
  • 39. Activation Functions - Why ? ● More Advanced Functions - Nonlinear. ● Should be Differentiable - for Backpropagation.
  • 40. Hyperparameters ● There are number of parameters that can be tuned in while building your neural network. ○ Number of Hidden Layers ○ Epochs ○ Loss Function ○ Optimization Function ○ Weight Initialization ○ Activation Functions ○ Batch Size ○ Learning Rate
  • 41. Batch Size ● The Batch Size is the number of samples that will be passed through the network at a time. ● Advantages ○ Your machine might not fit all the data in-memory at any given instance. ○ You want your model to generalize quickly.
  • 46. Training - Forward Propagation Hi = ∑i=1 wi *xi (Compact Representation) H1 = w1 *x1 + w2 *x2 + w3 *x3 (Expanded Representation) H1 = 0.15*0.05 + 0.20*0.10 + 0.30*0.02 H1 = σ(0.0335) = Hσ1 O1 = ∑i=0 Hi *wi (Compact Representation) O1 = H1 *0.33 = 0.0335*0.33 O1 = σ(0.011055) = Oσ1 σ = 1 / 1 + e-H Error = |Y - Yi |
  • 47. Training - Backward Propagation The Goal, Is to update each of the weights in the network so that they cause the actual output to be closer the target output.
  • 48. Training - Backward Propagation ∂Error/∂w4 = (∂Error/∂O 1 ) x (∂O 1 /∂O1 ) x (∂O1 /∂w4 ) ∂Error/∂wi = Partial derivative w.r.t wi w4 O 1O1 w 4 Error ∂Error/∂w1 = (∂Error/∂H 1 ) x (∂H 1 /∂H1 ) x (∂H1 /∂w1 ) = |Y - Yi | w1
  • 49. Training - Backward Propagation 0.33 0.04 H1 Etotal = Eo1 + Eo2 E0 E1 ∂Etotal /∂H 1 = ∂Eo1 /∂H 1 + ∂Eo2 /∂H 1 ∂Eo1 /∂H 1 = (∂Eo1 /∂H1 x (∂H1 /∂H 1 ) ∂Eo2 /∂H 1 = (∂Eo2 /∂H1 ) x (∂H1 /∂H 1 ) ∂Etotal /∂w1 = (∂Etotal /∂H 1 ) x (∂H 1 /∂H1 ) x (∂H1 /∂w1 ) w1 H 1
  • 50. Works perfect on training data ?
  • 51. Regularization Technique for preventing overfitting Regularization reduces overfitting by adding a penalty to the loss function
  • 52. Regularization- Dropout ● Dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random. ● Avoids co-dependency amongst neurons during training. ● Dropout with a given probability (20%-50%) in each weight update cycle. ● Dropout at each layer of the network has shown good results.
  • 54. References ● Adam Optimization ● Andrew Ng Youtube ● Siraj Raval Youtube ● Adam Optimization ● Cross Entropy ● Deep Learning Basics ● BackPropagation