SlideShare a Scribd company logo
Introduction to
Optimizing ML Models
for the Edge
Kumaran Ponnambalam
Principal Engineer - AI
Cisco Systems, Emerging Tech &
Incubation
© 2023 Cisco and/or its affiliates. All rights reserved.
Agenda
• Deploying deep learning models at the edge
• Model compression techniques
• Quantization
• Pruning
• Low rank approximation
• Knowledge distillation
• Leveraging edge hardware
• Model optimization best practices
2
3
Deep learning models at the edge
© 2023 Cisco and/or its affiliates. All rights reserved.
Edge AI : Growth & challenges
• Exponential growth in Edge AI applications
• Logistics, smart homes, transportation, security etc.
• Computer vision, NLP, time series
• Challenges with using cloud-based models
• Latency
• Reliable network connectivity
• Security & privacy
• Challenges deploying deep learning models at the edge
• Huge model footprint ( > available memory)
• Limited processing capacity
4
© 2023 Cisco and/or its affiliates. All rights reserved.
Deep learning models at the edge
5
Deep learning models need to be
optimized
for efficient and effective
inference at the edge
© 2023 Cisco and/or its affiliates. All rights reserved.
Edge AI : Goals for optimization
• Maintain model performance thresholds (Accuracy, F1, Recall, etc.)
• Reduce model sizes (compression)
• Improve model performance
• Latency
• FLOPS
• Power usage
• Leverage edge hardware capabilities
• Edge CPUs / GPUs
• Hardware accelerators
6
7
Model compression techniques
© 2023 Cisco and/or its affiliates. All rights reserved.
Model compression benefits
• Smaller memory footprint
• Reduced CPU/GPU time
• Lower latency
• Improved scaling per deployment
• Negligible loss of accuracy in most cases
• Easier packaging, transport and deployment
8
© 2023 Cisco and/or its affiliates. All rights reserved.
Quantization
• Reduce the storage size of parameters
• 32-bit to 8-bit (4X reduction)
• Less memory requirements
• Lower compute (FP vs INT operations)
• Energy saving
• Possible loss of accuracy (depends on
model)
• Popular ML frameworks support
quantization techniques
9
FP32
0.76 -0.10 1.45
-2.20 0.92 -0.89
-0.01 2.14 1.78
INT8
95 -13 181
-275 115 -111
-1 268 223
© 2023 Cisco and/or its affiliates. All rights reserved.
Types of quantization
10
• Done on a float trained model
after training
• Convert weights, biases and
activations to integers
• Simple to implement
• Loss of accuracy
Post-training quantization
• Done during training
• Impact of quantization validated
and adjusted
• Post-training quantization on
this model results in smaller/no
loss of accuracy
Quantization aware training
© 2023 Cisco and/or its affiliates. All rights reserved.
Quantization performance
11
Retrieved from : https://www.softserveinc.com/en-us/blog/deep-learning-model-compression-and-optimization
© 2023 Cisco and/or its affiliates. All rights reserved.
Model pruning
• Eliminate model elements with low impact
on outcomes
• Nodes
• Connections
• Layers
• Iterative pruning (increased sparsity) with
test for performance
• Size vs accuracy trade-off
• Effectiveness depends on nature of data
• Popular ML frameworks support pruning
techniques
12
© 2023 Cisco and/or its affiliates. All rights reserved.
Types of pruning
13
• Remove individual elements
• Connections
• Nodes
• Random removal with validation
• Can achieve higher size
reductions based on amount of
pruning
Unstructured Pruning
• Remove part of the network
• Layers
• Channels
• Filters
• Easier process
• Benefits based on model
Structured Pruning
© 2023 Cisco and/or its affiliates. All rights reserved.
Pruning performance
14
Retrieved from : https://nips.cc/virtual/2020/public/poster_703957b6dd9e3a7980e040bee50ded65.html
© 2023 Cisco and/or its affiliates. All rights reserved.
Low-rank approximation
• Reduce the number of parameters
necessary to represent the model
• Create matrix of lower rank
• Eliminate redundant data
• Measure performance with low rank
matrix
• Benefits vary based on the use case
• Popular ML frameworks have out-of-the-
box support
15
© 2023 Cisco and/or its affiliates. All rights reserved.
Low-rank approximation performance
16
Retrieved from : https://www.researchgate.net/figure/Post-training-results-of-low-rank-approximation-no-fine-
tuning-fine-tuning-with_tbl1_362859051
© 2023 Cisco and/or its affiliates. All rights reserved.
Knowledge distillation
• Train a small student model to mimic the
outputs of a large teacher model
• Distillation process compares outputs of
the models for the same inputs and
adjusts parameters for the student
• Smaller model footprint for student
• Comparable accuracy / performance
• Training dataset can be use-case specific
17
Large
teacher model
Small
student
model
Distillation
process
© 2023 Cisco and/or its affiliates. All rights reserved.
Comparison of techniques
18
Quantization Pruning Low-rank
Approximation
Knowledge
Distillation
Cost Low Low Medium High
During
training
Yes Yes Yes Yes
Post training Yes Yes Yes Yes
Pretrained
models
Yes Yes Yes No
© 2023 Cisco and/or its affiliates. All rights reserved.
Compression process
• Create a baseline of the original model
• Parameters, training data, test results
• Set threshold levels for compression expectations
• Expected minimum accuracy, maximum resource usage
• Use an iterative approach
• Try model compression in stages
• Test with baseline training data
• Compare with baseline test results and thresholds
• Try different techniques to identify best approach
• Combining approaches is possible ( e.g., quantization and pruning )
19
20
Leveraging edge hardware
© 2023 Cisco and/or its affiliates. All rights reserved.
Edge specialized infrastructure
• Edge optimized hardware
• Deliver best performance for edge constrained environments
• Low end processors: Micro-controller units, neural processing units (NPU)
• High end processors: Google Edge TPU, NVIDIA Jetson
• Application specific AI accelerators
• Edge frameworks
• Compile models to optimize for edge specific hardware
• Leverage hardware specific capabilities
• Create deployable packages for models
• E.g., NVIDIA TensorRT, Apache TVM, ONNX runtime
21
© 2023 Cisco and/or its affiliates. All rights reserved.
Edge frameworks - benefits
• Optimize execution graph for hardware
• Reduce memory requirements
• Remove unwanted steps/instructions
• Fuse steps/instructions
• Choose best values for configuration options
• Evaluate multiple execution strategies and choose the best one
• Create an optimized executable for inference
• Package model for ready deployment
22
© 2023 Cisco and/or its affiliates. All rights reserved.
Edge compilation process
• Choose the right framework based on the deployment hardware
• E.g., TensorRT is most suited for NVIDIA processors
• Use the trained, validated and compressed model as input
• Compile the model
• Plan for multiple iterations
• Try available options for optimization/adaption
• Validate model performance
• Use same benchmarks as compression
• Test on hardware specific development kits
• Create deployable artifact
23
24
Model optimization best practices
© 2023 Cisco and/or its affiliates. All rights reserved.
Best practices for optimization - 1
• Performance baselines and goals need to be established and validated
throughout the process
• Accuracy, latency, FLOPS, etc., based on the use case
• Helps ensure that the model performs as desired while going through optimizations
• Track results against baseline for all model training iterations over time
• Choose hardware / frameworks when beginning model training
• Deployment infrastructure may impact model architecture and optimizations needed
• Dependency / overlaps can be understood ahead of time
• Multiple deployment options may need to be supported
25
© 2023 Cisco and/or its affiliates. All rights reserved.
Best practices for optimization - 2
• Include edge hardware development kits/emulators as part of the training
lifecycle
• Automate optimization, compilation and testing
• Use similar hardware configurations as deployment
• Include collaborating edge applications also in end-to-end testing
• Automate validating results and model promotion
• Monitor deployment performance
• Some optimizations may carry negative impact when deployed in actual hardware
• Monitor performance and validate against set baseline
• Improve models based on experience
26
27
Thank You

More Related Content

Similar to “Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisco Systems (20)

PPTX
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
DataScienceConferenc1
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Paul Brebner
 
PDF
Machine Learning for Capacity Management
EDB
 
PDF
Productionising Machine Learning Models
Tash Bickley
 
PDF
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Lionel Briand
 
PDF
MLOPS By Amazon offered and free download
pouyan533
 
PDF
Microservices at Scale: How to Reduce Overhead and Increase Developer Product...
DevOps.com
 
PPT
IBM PureFlex - Expert Integrated System
IBM Danmark
 
PDF
1457 - Reviewing Experiences from the PureExperience Program
Hendrik van Run
 
PPTX
Cloud basic thoughts and environment setup
kshivpuri7
 
PDF
Mass Scale Networking
Steve Iatrou
 
DOCX
Resume of sidharam prachcande PLM Consultant
Sidharth prachande
 
PDF
A00-440: Useful Questions for SAS ModelOps Specialist Certification Success
PalakMazumdar1
 
PDF
A Year of “Testing” the Cloud for Development and Test
TechWell
 
PPTX
20703-1B_00.pptx
BenAissaTaher1
 
PDF
Apache CloudStack Examination - CloudStack Collaboration Conference in Europe...
Midori Oge
 
PPTX
Predictive Maintenance using Deep Learning.pptx
ksks28058
 
PDF
A Year of Testing in the Cloud: Lessons Learned
TechWell
 
PDF
Building successful and secure products with AI and ML
Simon Lia-Jonassen
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
DataScienceConferenc1
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Paul Brebner
 
Machine Learning for Capacity Management
EDB
 
Productionising Machine Learning Models
Tash Bickley
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Lionel Briand
 
MLOPS By Amazon offered and free download
pouyan533
 
Microservices at Scale: How to Reduce Overhead and Increase Developer Product...
DevOps.com
 
IBM PureFlex - Expert Integrated System
IBM Danmark
 
1457 - Reviewing Experiences from the PureExperience Program
Hendrik van Run
 
Cloud basic thoughts and environment setup
kshivpuri7
 
Mass Scale Networking
Steve Iatrou
 
Resume of sidharam prachcande PLM Consultant
Sidharth prachande
 
A00-440: Useful Questions for SAS ModelOps Specialist Certification Success
PalakMazumdar1
 
A Year of “Testing” the Cloud for Development and Test
TechWell
 
20703-1B_00.pptx
BenAissaTaher1
 
Apache CloudStack Examination - CloudStack Collaboration Conference in Europe...
Midori Oge
 
Predictive Maintenance using Deep Learning.pptx
ksks28058
 
A Year of Testing in the Cloud: Lessons Learned
TechWell
 
Building successful and secure products with AI and ML
Simon Lia-Jonassen
 

More from Edge AI and Vision Alliance (20)

PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
Edge AI and Vision Alliance
 
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
Edge AI and Vision Alliance
 
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
Edge AI and Vision Alliance
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
Edge AI and Vision Alliance
 
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
Edge AI and Vision Alliance
 
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Edge AI and Vision Alliance
 
PDF
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
Edge AI and Vision Alliance
 
PDF
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
PDF
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
Edge AI and Vision Alliance
 
PDF
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
PDF
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
PDF
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
 
PDF
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
Edge AI and Vision Alliance
 
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
Edge AI and Vision Alliance
 
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
Edge AI and Vision Alliance
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
Edge AI and Vision Alliance
 
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
Edge AI and Vision Alliance
 
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Edge AI and Vision Alliance
 
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
Edge AI and Vision Alliance
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
Edge AI and Vision Alliance
 
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
 
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
Ad

Recently uploaded (20)

PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
Ad

“Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisco Systems

  • 1. Introduction to Optimizing ML Models for the Edge Kumaran Ponnambalam Principal Engineer - AI Cisco Systems, Emerging Tech & Incubation
  • 2. © 2023 Cisco and/or its affiliates. All rights reserved. Agenda • Deploying deep learning models at the edge • Model compression techniques • Quantization • Pruning • Low rank approximation • Knowledge distillation • Leveraging edge hardware • Model optimization best practices 2
  • 4. © 2023 Cisco and/or its affiliates. All rights reserved. Edge AI : Growth & challenges • Exponential growth in Edge AI applications • Logistics, smart homes, transportation, security etc. • Computer vision, NLP, time series • Challenges with using cloud-based models • Latency • Reliable network connectivity • Security & privacy • Challenges deploying deep learning models at the edge • Huge model footprint ( > available memory) • Limited processing capacity 4
  • 5. © 2023 Cisco and/or its affiliates. All rights reserved. Deep learning models at the edge 5 Deep learning models need to be optimized for efficient and effective inference at the edge
  • 6. © 2023 Cisco and/or its affiliates. All rights reserved. Edge AI : Goals for optimization • Maintain model performance thresholds (Accuracy, F1, Recall, etc.) • Reduce model sizes (compression) • Improve model performance • Latency • FLOPS • Power usage • Leverage edge hardware capabilities • Edge CPUs / GPUs • Hardware accelerators 6
  • 8. © 2023 Cisco and/or its affiliates. All rights reserved. Model compression benefits • Smaller memory footprint • Reduced CPU/GPU time • Lower latency • Improved scaling per deployment • Negligible loss of accuracy in most cases • Easier packaging, transport and deployment 8
  • 9. © 2023 Cisco and/or its affiliates. All rights reserved. Quantization • Reduce the storage size of parameters • 32-bit to 8-bit (4X reduction) • Less memory requirements • Lower compute (FP vs INT operations) • Energy saving • Possible loss of accuracy (depends on model) • Popular ML frameworks support quantization techniques 9 FP32 0.76 -0.10 1.45 -2.20 0.92 -0.89 -0.01 2.14 1.78 INT8 95 -13 181 -275 115 -111 -1 268 223
  • 10. © 2023 Cisco and/or its affiliates. All rights reserved. Types of quantization 10 • Done on a float trained model after training • Convert weights, biases and activations to integers • Simple to implement • Loss of accuracy Post-training quantization • Done during training • Impact of quantization validated and adjusted • Post-training quantization on this model results in smaller/no loss of accuracy Quantization aware training
  • 11. © 2023 Cisco and/or its affiliates. All rights reserved. Quantization performance 11 Retrieved from : https://www.softserveinc.com/en-us/blog/deep-learning-model-compression-and-optimization
  • 12. © 2023 Cisco and/or its affiliates. All rights reserved. Model pruning • Eliminate model elements with low impact on outcomes • Nodes • Connections • Layers • Iterative pruning (increased sparsity) with test for performance • Size vs accuracy trade-off • Effectiveness depends on nature of data • Popular ML frameworks support pruning techniques 12
  • 13. © 2023 Cisco and/or its affiliates. All rights reserved. Types of pruning 13 • Remove individual elements • Connections • Nodes • Random removal with validation • Can achieve higher size reductions based on amount of pruning Unstructured Pruning • Remove part of the network • Layers • Channels • Filters • Easier process • Benefits based on model Structured Pruning
  • 14. © 2023 Cisco and/or its affiliates. All rights reserved. Pruning performance 14 Retrieved from : https://nips.cc/virtual/2020/public/poster_703957b6dd9e3a7980e040bee50ded65.html
  • 15. © 2023 Cisco and/or its affiliates. All rights reserved. Low-rank approximation • Reduce the number of parameters necessary to represent the model • Create matrix of lower rank • Eliminate redundant data • Measure performance with low rank matrix • Benefits vary based on the use case • Popular ML frameworks have out-of-the- box support 15
  • 16. © 2023 Cisco and/or its affiliates. All rights reserved. Low-rank approximation performance 16 Retrieved from : https://www.researchgate.net/figure/Post-training-results-of-low-rank-approximation-no-fine- tuning-fine-tuning-with_tbl1_362859051
  • 17. © 2023 Cisco and/or its affiliates. All rights reserved. Knowledge distillation • Train a small student model to mimic the outputs of a large teacher model • Distillation process compares outputs of the models for the same inputs and adjusts parameters for the student • Smaller model footprint for student • Comparable accuracy / performance • Training dataset can be use-case specific 17 Large teacher model Small student model Distillation process
  • 18. © 2023 Cisco and/or its affiliates. All rights reserved. Comparison of techniques 18 Quantization Pruning Low-rank Approximation Knowledge Distillation Cost Low Low Medium High During training Yes Yes Yes Yes Post training Yes Yes Yes Yes Pretrained models Yes Yes Yes No
  • 19. © 2023 Cisco and/or its affiliates. All rights reserved. Compression process • Create a baseline of the original model • Parameters, training data, test results • Set threshold levels for compression expectations • Expected minimum accuracy, maximum resource usage • Use an iterative approach • Try model compression in stages • Test with baseline training data • Compare with baseline test results and thresholds • Try different techniques to identify best approach • Combining approaches is possible ( e.g., quantization and pruning ) 19
  • 21. © 2023 Cisco and/or its affiliates. All rights reserved. Edge specialized infrastructure • Edge optimized hardware • Deliver best performance for edge constrained environments • Low end processors: Micro-controller units, neural processing units (NPU) • High end processors: Google Edge TPU, NVIDIA Jetson • Application specific AI accelerators • Edge frameworks • Compile models to optimize for edge specific hardware • Leverage hardware specific capabilities • Create deployable packages for models • E.g., NVIDIA TensorRT, Apache TVM, ONNX runtime 21
  • 22. © 2023 Cisco and/or its affiliates. All rights reserved. Edge frameworks - benefits • Optimize execution graph for hardware • Reduce memory requirements • Remove unwanted steps/instructions • Fuse steps/instructions • Choose best values for configuration options • Evaluate multiple execution strategies and choose the best one • Create an optimized executable for inference • Package model for ready deployment 22
  • 23. © 2023 Cisco and/or its affiliates. All rights reserved. Edge compilation process • Choose the right framework based on the deployment hardware • E.g., TensorRT is most suited for NVIDIA processors • Use the trained, validated and compressed model as input • Compile the model • Plan for multiple iterations • Try available options for optimization/adaption • Validate model performance • Use same benchmarks as compression • Test on hardware specific development kits • Create deployable artifact 23
  • 25. © 2023 Cisco and/or its affiliates. All rights reserved. Best practices for optimization - 1 • Performance baselines and goals need to be established and validated throughout the process • Accuracy, latency, FLOPS, etc., based on the use case • Helps ensure that the model performs as desired while going through optimizations • Track results against baseline for all model training iterations over time • Choose hardware / frameworks when beginning model training • Deployment infrastructure may impact model architecture and optimizations needed • Dependency / overlaps can be understood ahead of time • Multiple deployment options may need to be supported 25
  • 26. © 2023 Cisco and/or its affiliates. All rights reserved. Best practices for optimization - 2 • Include edge hardware development kits/emulators as part of the training lifecycle • Automate optimization, compilation and testing • Use similar hardware configurations as deployment • Include collaborating edge applications also in end-to-end testing • Automate validating results and model promotion • Monitor deployment performance • Some optimizations may carry negative impact when deployed in actual hardware • Monitor performance and validate against set baseline • Improve models based on experience 26