“Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisco Systems

Introduction to
Optimizing ML Models
for the Edge
Kumaran Ponnambalam
Principal Engineer - AI
Cisco Systems, Emerging Tech &
Incubation

© 2023 Cisco and/or its affiliates. All rights reserved.
Agenda
• Deploying deep learning models at the edge
• Model compression techniques
• Quantization
• Pruning
• Low rank approximation
• Knowledge distillation
• Leveraging edge hardware
• Model optimization best practices
2

3
Deep learning models at the edge

Edge AI : Growth & challenges
• Exponential growth in Edge AI applications
• Logistics, smart homes, transportation, security etc.
• Computer vision, NLP, time series
• Challenges with using cloud-based models
• Latency
• Reliable network connectivity
• Security & privacy
• Challenges deploying deep learning models at the edge
• Huge model footprint ( > available memory)
• Limited processing capacity
4

Deep learning models at the edge
5
Deep learning models need to be
optimized
for efficient and effective
inference at the edge

Edge AI : Goals for optimization
• Maintain model performance thresholds (Accuracy, F1, Recall, etc.)
• Reduce model sizes (compression)
• Improve model performance
• Latency
• FLOPS
• Power usage
• Leverage edge hardware capabilities
• Edge CPUs / GPUs
• Hardware accelerators
6

7
Model compression techniques

Model compression benefits
• Smaller memory footprint
• Reduced CPU/GPU time
• Lower latency
• Improved scaling per deployment
• Negligible loss of accuracy in most cases
• Easier packaging, transport and deployment
8

Quantization
• Reduce the storage size of parameters
• 32-bit to 8-bit (4X reduction)
• Less memory requirements
• Lower compute (FP vs INT operations)
• Energy saving
• Possible loss of accuracy (depends on
model)
• Popular ML frameworks support
quantization techniques
9
FP32
0.76 -0.10 1.45
-2.20 0.92 -0.89
-0.01 2.14 1.78
INT8
95 -13 181
-275 115 -111
-1 268 223

Types of quantization
10
• Done on a float trained model
after training
• Convert weights, biases and
activations to integers
• Simple to implement
• Loss of accuracy
Post-training quantization
• Done during training
• Impact of quantization validated
and adjusted
• Post-training quantization on
this model results in smaller/no
loss of accuracy
Quantization aware training

Quantization performance
11
Retrieved from : https://www.softserveinc.com/en-us/blog/deep-learning-model-compression-and-optimization

Model pruning
• Eliminate model elements with low impact
on outcomes
• Nodes
• Connections
• Layers
• Iterative pruning (increased sparsity) with
test for performance
• Size vs accuracy trade-off
• Effectiveness depends on nature of data
• Popular ML frameworks support pruning
techniques
12

Types of pruning
13
• Remove individual elements
• Connections
• Nodes
• Random removal with validation
• Can achieve higher size
reductions based on amount of
pruning
Unstructured Pruning
• Remove part of the network
• Layers
• Channels
• Filters
• Easier process
• Benefits based on model
Structured Pruning

Pruning performance
14
Retrieved from : https://nips.cc/virtual/2020/public/poster_703957b6dd9e3a7980e040bee50ded65.html

Low-rank approximation
• Reduce the number of parameters
necessary to represent the model
• Create matrix of lower rank
• Eliminate redundant data
• Measure performance with low rank
matrix
• Benefits vary based on the use case
• Popular ML frameworks have out-of-the-
box support
15

Low-rank approximation performance
16
Retrieved from : https://www.researchgate.net/figure/Post-training-results-of-low-rank-approximation-no-fine-
tuning-fine-tuning-with_tbl1_362859051

Knowledge distillation
• Train a small student model to mimic the
outputs of a large teacher model
• Distillation process compares outputs of
the models for the same inputs and
adjusts parameters for the student
• Smaller model footprint for student
• Comparable accuracy / performance
• Training dataset can be use-case specific
17
Large
teacher model
Small
student
model
Distillation
process

Comparison of techniques
18
Quantization Pruning Low-rank
Approximation
Knowledge
Distillation
Cost Low Low Medium High
During
training
Yes Yes Yes Yes
Post training Yes Yes Yes Yes
Pretrained
models
Yes Yes Yes No

Compression process
• Create a baseline of the original model
• Parameters, training data, test results
• Set threshold levels for compression expectations
• Expected minimum accuracy, maximum resource usage
• Use an iterative approach
• Try model compression in stages
• Test with baseline training data
• Compare with baseline test results and thresholds
• Try different techniques to identify best approach
• Combining approaches is possible ( e.g., quantization and pruning )
19

Edge specialized infrastructure
• Edge optimized hardware
• Deliver best performance for edge constrained environments
• Low end processors: Micro-controller units, neural processing units (NPU)
• High end processors: Google Edge TPU, NVIDIA Jetson
• Application specific AI accelerators
• Edge frameworks
• Compile models to optimize for edge specific hardware
• Leverage hardware specific capabilities
• Create deployable packages for models
• E.g., NVIDIA TensorRT, Apache TVM, ONNX runtime
21

Edge frameworks - benefits
• Optimize execution graph for hardware
• Reduce memory requirements
• Remove unwanted steps/instructions
• Fuse steps/instructions
• Choose best values for configuration options
• Evaluate multiple execution strategies and choose the best one
• Create an optimized executable for inference
• Package model for ready deployment
22

Edge compilation process
• Choose the right framework based on the deployment hardware
• E.g., TensorRT is most suited for NVIDIA processors
• Use the trained, validated and compressed model as input
• Compile the model
• Plan for multiple iterations
• Try available options for optimization/adaption
• Validate model performance
• Use same benchmarks as compression
• Test on hardware specific development kits
• Create deployable artifact
23

24
Model optimization best practices

Best practices for optimization - 1
• Performance baselines and goals need to be established and validated
throughout the process
• Accuracy, latency, FLOPS, etc., based on the use case
• Helps ensure that the model performs as desired while going through optimizations
• Track results against baseline for all model training iterations over time
• Choose hardware / frameworks when beginning model training
• Deployment infrastructure may impact model architecture and optimizations needed
• Dependency / overlaps can be understood ahead of time
• Multiple deployment options may need to be supported
25

Best practices for optimization - 2
• Include edge hardware development kits/emulators as part of the training
lifecycle
• Automate optimization, compilation and testing
• Use similar hardware configurations as deployment
• Include collaborating edge applications also in end-to-end testing
• Automate validating results and model promotion
• Monitor deployment performance
• Some optimizations may carry negative impact when deployed in actual hardware
• Monitor performance and validate against set baseline
• Improve models based on experience
26

“Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisco Systems

More Related Content

Similar to “Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisco Systems (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

“Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisco Systems