SlideShare a Scribd company logo
Salil Tambe, Adobe
Manuj Sabharwal, Intel
2
Agenda
• Why Deep Learning inference on Client
– Advantages of running Intelligence on Client
– Inference use cases on edge platform
• Tools to get the best from Intel® Processor Graphics
• Case Study
• Emerging Deep Learning Technologies
• Summary
3
Advantage of running intelligence on
client
Trust &
privacy
Network
Bandwidth
Responsi
veness
Service
Cost
30% 5%20%
4
Intelligence use cases on client -
Examples
Cyberlink* PowerDirector – Style Transfer2
Unity* ML Agents - Bringing intelligence to game on client4
1https://theblog.adobe.com/premiere-pro-updates-spring-2018/
2https://www.cyberlink.com/products/creative-design-packs/ai_en_US.html
3 https://github.com/msmsajjadi/frvsr
4https://github.com/Unity-Technologies/ml-agents
1Color Match powered by Adobe® Sensei™
3State of Art – Frame Recurrent Video super resolution – Sajjadi etl.
*Other names and brands may be claimed as the property of others.
5
Baseline Performance with Use cases –
(What you may have heard )
• None of the typical topologies hitting real-time inference
• Standard framework (e.g. Caffe/Caffe2) with Intel® MKL BLAS and using Intel® AVX2 instructions
• Optimization opportunity available
0
5
10
15
20
ResNet-152 SqueezeNet-1.0 VGG19 GoogleNet-v4 ISV-P1 ISV-P2 ISV-P3
FPS
Baseline CPU Perf with MKL BLAS Integrated in Standard Framework
Typical topologies
3rd Party ISV Projects
Standard image input used as describe in “state of art algorithm”.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. Configurations: - Intel® 6th Generation Core i7-6700 CPU Processor with Intel HD Graphics. OS : Windows* 10
Benchmark results were obtained post implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown.” Implementation of these updates may make these results inapplicable to your device or system.
Performance lags to meet requirements for real-time use case
*Other names and brands may be claimed as the property of others.
6
openvino™ toolkit
Consistent workflow to enable inference on all IA
Caffe*
TensorFlow*
MxNet*
Convert & optimize to
fit all targets
InferenceEngineCommonAPI
(C++)
Load, infer
CPU Plugin
GPU Plugin
FPGA Plugin
Myriad-2 Plugin
Hetro–plugin
Model
Optimizer
GNA Plugin
Extendibility
(C++)
Extendibility
(OpenCL)
Available today
.bin
IR
.XML
Convert
Generic
Optimization
Target
Optimization
Compile Execute
 Enable our customers to deploy a trained model on all Intel related architecture:
 CPU, GPU, Intel® Movidious™ Myriad™ 2 and other Intel accelerators
 Optimize for best execution
 Enable user to validate and tune
 Easy to use runtime API across all devices
ONNX
.onnx
*Other names and brands may be claimed as the property of others.
7
Why Intel® processor Graphics ?
https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf
On millions of SoC
Free for consumers  No
need to buy special HW
Better battery life based on
use case
Better performance based on
use case
8
Using the Inference Engine API
(this is working code!)
IRSerializer reader;
reader.Open("GoogleNet.xml");
reader.ReadWeights("GoogleNet.bin");
Network net = reader.GetNetwork();
InfEng engine(TargetDevice::GPU);
auto exec = engine.Compile(net);
// optionally compilation can be saved:
exec.save(“GoogleNet.nnx”);
auto inData = make_shared_blob<float>({ 0.1f,0.2f,0.3f,0.4f,0.5f });
auto outData = exec.Infer(inData);
const LockedMemory<const float> pfOut = outData->readOnly();
// read output …
1. Read network from IR
2. Select Target
3. Compile & Load
4. Execute…
9
Advantage of using OpenVINO™
• Easy installation on Windows* and Linux*
• Visual Studio* integration
• C++ Support
• Extensibility support
• Developers can add new primitives to support new operators
• One API support across different hardware
• Size overhead < 10MB additional file for enabling for Intel GPU
• Model optimizer can fuse layers to give significant improvement on file size
*Other names and brands may be claimed as the property of others.
10
performance using cldnn kernels
through openvino™
3.8
19.0
2.3 2.3 3.0 2.0 1.0
17.5
173.7
18.1 13.1
27.0 20.5
9.9
0
20
40
60
80
100
120
140
160
180
200
ResNet-152 SqueezeNet-1.0 VGG19 GoogleNet-v4 ISV-P1 ISV-P2 ISV-P3
FPS
Performance Increase due to Intel GPU
Baseline clDNN (FP16)
5-10x performance increase with
optimized software stack
Optimized Software + Hardware == Better Performance
Intel GPU provides 5-10x performance increase vs. baseline with no extra cost to
hardware and low power
Standard image input used as describe in “state of art algorithm”.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. Configurations: - Intel® 6th Generation Core i7-6700 CPU Processor with Intel HD Graphics. OS : Windows* 10
Benchmark results were obtained post implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown.” Implementation of these updates may make these results inapplicable to your device or system.
*Other names and brands may be claimed as the property of others.
11
Adobe® deep matte Proof of
concept optimizations on intel GPU
using openvino™
*Other names and brands may be claimed as the property of others.
12
Select & Mask in Photoshop*
*Other names and brands may be claimed as the property of others.
13
Select & Mask in Photoshop*
*Other names and brands may be claimed as the property of others.
14
Matting in Photoshop*
*Other names and brands may be claimed as the property of others.
15
Deep Matting
Image Input Output
Xu et al. Deep Image Matting. CVPR 2017.
16
Image
Photos
hop
Matte
Dee
p
Matt
e
Correct
matting for
hair
Hair, should
be white
Trimap
17
Tech Transfer Challenges
• Resolution (320 x 320)
• Model size (80 MB)
• Memory
• Run time performance
• Cross platform support
Image Credits: Deep Image Matting [Xu et. al; CVPR 2017]
Inference per Tile
Conv2
[160 x 160 x 128]
Conv1
[320 x 320 x 64]
Conv4
[40 x 40 x 512]
Alpha
Prediction
[320 x 320 x 1]
Conv3
[80 x 80 x 256] Deconv4
[40 x 40 x 256]
Deconv3
[80 x 80 x 128]
Deconv1
[320 x 320 x 64]
Conv5
[20 x 20 x 512]
Deconv5
[20 x 20 x 512]
Deconv2
[160 x 160 x 64]
19
Refinement
Network
Final
Matte
+
Fine Network
20
Baseline performance
21
Initial implementation performance
Common issues
 Threading is not optimal
 Reordering of data structure between layers increases
compute time
2.35 seconds  One Frame
Standard image input used as describe in “state of art algorithm”.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. Configurations: - Intel® 8th Generation 8250U (4C:8T) CPU Processor with Intel HD Graphics. OS : Windows* 10
• Memory Utilized ~3GB
22
Optimizing for intel® Processor
graphics
23
Supporting custom layers
Custom layers are the layers that are not included into a list of known layers.
If your topology contains any layers that are not in the list, the Model Optimizer classifies them as custom
Solution
Register those layers as extensions to the Model Optimize
Registering rules to pass Extension Layer properties from model to IR
New primitives are now part of OpenVINO™
Step-1 convert model to intermediate
representation (ir)
class pooling(Op):
op = 'Pooling'
def __init__(self, graph, attrs):
super().__init__(graph, {
'type': __class__.op,
'op': __class__.op,
'infer': pooling.pooling_infer},
attrs)
def supported_attrs(self):
return ['stride-y','stride-x','pad-y','pad-
x','kernel-y','kernel-x','pool-method','rounding-type']
class CustomPoolingFrontExtractor(FrontExtractorOp):
op = 'Pooling'
@staticmethod
def extract(node):
proto_layer = node.pb
param = proto_layer.pooling_param
mapping_rule = {
'stride-y':param.stride,
'stride-x':param.stride,
'type':'Pooling',
'pool':param.pool,
'kernel-y': param.kernel_size,
'kernel-x': param.kernel_size,
'pool-method':"max",
'rounding-type':"ceil",
}
24
Step-2 : integrating code into
inference engine
InferenceEngine::CNNNetReader net_reader;
net_reader.ReadNetwork(model.data(), model.length()));  Read model from memory or directly load file
// 256 is a size in bytes!
InferenceEngine::TBlob<uint8_t> *weights = new InferenceEngine::TBlob<uint8_t>(InferenceEngine::Precision::U8, InferenceEngine::C, {256});
weights->allocate();
fill_data((float *) weights->buffer(), weights->size() / sizeof(float)); // Fill weights from model
InferenceEngine::TBlob<uint8_t>::Ptr weights_ptr = InferenceEngine::TBlob<uint8_t>::Ptr(weights);
net_reader.SetWeights(weights_ptr);
net_reader.getNetwork();
InputsDataMap inputInfo(network.getInputsInfo()); // Stores all input blobs data
BlobMap inputBlobs;
ExecutableNetwork executable_network = plugin.LoadNetwork(network, {});
InferRequest infer_request = executable_network.CreateInferRequest();
// Fill input image to data buffer
data[image_id * image_size * num_channels + ch * image_size + pid] = imagesData.at(image_id).get()[pid*num_channels + ch];
// Start Infer (This can be Async infer too)
Infer_request.Infer();
const Blob::Ptr output_blob = infer_request.GetBlob(firstOutputName);
const auto output_data = output_blob->buffer().as<float*>();
25
How performance looks with openvino™ using cldnn
OpenVINO™ SDK enables Intel GPU path without change of API call
 Free performance boost on millions of PC shipped with integrated GPU
Enabled custom kernel through device code
 FP16 implementation should give up to 1.8x perf boost
CPU GPU (FP32)
Package (SoC) Power 26.9W 21.9W
Performance 2.35 seconds 291msec
22% less power and 8x faster compared to Intel GPU solution with no additional hardware cost
291msec
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. Configurations: - Intel® 8th Generation 8250U (4C:8T) CPU Processor with Intel HD Graphics. OS : Windows* 10
26
How about memory usage?
~700MB used for 320x320 image
Default Implementation
Using OpenVINO™
OpenVINO™ optimized the model for efficient memory which is important for client based platform
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. Configurations: - Intel® 8th Generation 8250U (4C:8T) CPU Processor with Intel HD Graphics. OS : Windows* 10
27
Examples of Emerging AI Technologies –
Federated Learning*
Federated learning or Collaborative learning is where multiple devices participate in the machine
learning process (training or inferencing).
Federated learning decouples storage from the machine learning process
* Concept introduced by Google (April 2017)
Cloud
Device
A
Device
B
Device
C
28
End-to-end deep learning on intel
Store
Distribute
Deliver
Proces
s
Analyze
Ingest
Proces
s
Capture Transmit Ingest
Store
Process
Analyze
Process
Distribute Delivery Consume
On-Device
Processing
Wi-Fi
5G
Ethernet
Edge
Ingest
Processing
& Storage
Ingest
Process
Distributio
n
Process
Analyze & Manipulate
Storage
Edge
Delivery
Processing
& Storage
Wi-Fi
5G
Ethernet
On Device
Processing
29
Using optimized libraries , software stacks and SDK can give 7-10x
performance boost on same hardware
Intel GPU can give boost in performance with lower power compared to CPU
with no additional cost to consumer/developer (depending on topologies)
 Boost in performance on millions of PC shipped with Intel® Processor Graphics
OpenVINO™ makes it simple to start developer on Windows*
 It provides better performance and efficient memory for deployment
Heterogeneous support of running multiple models on different IP
Call to developers/communities
 Re-think about bringing inference offline to Intel GPU
Summary
*Other names and brands may be claimed as the property of others.
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH 2018 Tech Session
31
references
• https://software.intel.com/en-us/openvino-toolkit
• https://github.com/intel/clDNN
• https://github.com/intel/mkl-dnn
• https://software.intel.com/en-us/articles/accelerate-deep-learning-inference-
with-integrated-intel-processor-graphics-rev-2-0
• https://software.intel.com/en-us/articles/background-on-ai-and-the-move-to-
the-edge
• https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-
Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf
32
Acronyms
• Compute Library for Deep Neural Networks (clDNN)
• Intel® Math Kernel Library (Intel® MKL)
• Intel® Advanced Vector Extensions 2 (Intel® AVX2)
• Intel Architecture (IA)
• Intermediate Representation (IR)
• Independent software vendor (ISV)
33
Legal Disclaimers and Optimization
NoticesNo license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular
purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change
without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications.
Current characterized errata are available on request.
Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation.
Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or
retailer or learn more at intel.com.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change
to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products. For more information go to
www.intel.com/benchmarks
Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred
to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.
Intel, Openvino, Movidius, Myriad and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
© Intel Corporation.
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH 2018 Tech Session

More Related Content

What's hot (20)

PPT
Optimizing Direct X On Multi Core Architectures
psteinb
 
PPTX
Forts and Fights Scaling Performance on Unreal Engine*
Intel® Software
 
PPTX
Optimization Deep Dive: Unreal Engine 4 on Intel
Intel® Software
 
PDF
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Software
 
PDF
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
Intel® Software
 
PDF
Real-Time Game Optimization with Intel® GPA
Intel® Software
 
PDF
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
Intel® Software
 
PPTX
Intel® Graphics Performance Analyzers
Intel® Software
 
PDF
Create a Scalable and Destructible World in HITMAN 2*
Intel® Software
 
PDF
More explosions, more chaos, and definitely more blowing stuff up
Intel® Software
 
PPTX
Real-Time Game Optimization with Intel® GPA
Intel® Software
 
PDF
In The Trenches Optimizing UE4 for Intel
Intel® Software
 
PDF
Scalability for All: Unreal Engine* 4 with Intel
Intel® Software
 
PDF
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Intel® Software
 
PPTX
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
Intel® Software
 
PDF
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Intel® Software
 
PPTX
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Intel® Software
 
PDF
Streamed Cloud Gaming Solutions for Android* and PC Games
Intel® Software
 
PDF
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
Intel® Software
 
PDF
It Doesn't Have to Be Hard: How to Fix Your Performance Woes
Intel® Software
 
Optimizing Direct X On Multi Core Architectures
psteinb
 
Forts and Fights Scaling Performance on Unreal Engine*
Intel® Software
 
Optimization Deep Dive: Unreal Engine 4 on Intel
Intel® Software
 
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Software
 
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
Intel® Software
 
Real-Time Game Optimization with Intel® GPA
Intel® Software
 
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
Intel® Software
 
Intel® Graphics Performance Analyzers
Intel® Software
 
Create a Scalable and Destructible World in HITMAN 2*
Intel® Software
 
More explosions, more chaos, and definitely more blowing stuff up
Intel® Software
 
Real-Time Game Optimization with Intel® GPA
Intel® Software
 
In The Trenches Optimizing UE4 for Intel
Intel® Software
 
Scalability for All: Unreal Engine* 4 with Intel
Intel® Software
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Intel® Software
 
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
Intel® Software
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Intel® Software
 
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Intel® Software
 
Streamed Cloud Gaming Solutions for Android* and PC Games
Intel® Software
 
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
Intel® Software
 
It Doesn't Have to Be Hard: How to Fix Your Performance Woes
Intel® Software
 

Similar to Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH 2018 Tech Session (20)

PPTX
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™
Yury Gorbachev
 
PDF
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
MAKERPRO.cc
 
PDF
“Optimization Techniques with Intel’s OpenVINO to Enhance Performance on Your...
Edge AI and Vision Alliance
 
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Intel® Software
 
PPTX
How to Get the Best Deep Learning performance with OpenVINO Toolkit
Yury Gorbachev
 
PDF
AIDC Summit LA- Hands-on Training
Intel® Software
 
PDF
“Smarter Manufacturing with Intel’s Deep Learning-Based Machine Vision,” a Pr...
Edge AI and Vision Alliance
 
PDF
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...
Edge AI and Vision Alliance
 
PPTX
OpenVINO introduction
Yury Gorbachev
 
PDF
Microsoft Build 2019- Intel AI Workshop
Intel® Software
 
PDF
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
Edge AI and Vision Alliance
 
PDF
Scalable AI Solution cross AI platforms
KTN
 
PDF
AI Crash Course- Supercomputing
Intel IT Center
 
PDF
“Getting Efficient DNN Inference Performance: Is It Really About the TOPS?,” ...
Edge AI and Vision Alliance
 
PDF
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
PDF
2020 AI Ready Solution
IEI Integration Corp.
 
PDF
Accelerate Your AI Today
DESMOND YUEN
 
PPTX
FPGA Inference - DellEMC SURFsara
Intel IT Center
 
PDF
FPGAs and Machine Learning
inside-BigData.com
 
PDF
Accelerate Machine Learning Software on Intel Architecture
Intel® Software
 
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™
Yury Gorbachev
 
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
MAKERPRO.cc
 
“Optimization Techniques with Intel’s OpenVINO to Enhance Performance on Your...
Edge AI and Vision Alliance
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Intel® Software
 
How to Get the Best Deep Learning performance with OpenVINO Toolkit
Yury Gorbachev
 
AIDC Summit LA- Hands-on Training
Intel® Software
 
“Smarter Manufacturing with Intel’s Deep Learning-Based Machine Vision,” a Pr...
Edge AI and Vision Alliance
 
"How to Get the Best Deep Learning Performance with the OpenVINO Toolkit," a ...
Edge AI and Vision Alliance
 
OpenVINO introduction
Yury Gorbachev
 
Microsoft Build 2019- Intel AI Workshop
Intel® Software
 
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
Edge AI and Vision Alliance
 
Scalable AI Solution cross AI platforms
KTN
 
AI Crash Course- Supercomputing
Intel IT Center
 
“Getting Efficient DNN Inference Performance: Is It Really About the TOPS?,” ...
Edge AI and Vision Alliance
 
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
2020 AI Ready Solution
IEI Integration Corp.
 
Accelerate Your AI Today
DESMOND YUEN
 
FPGA Inference - DellEMC SURFsara
Intel IT Center
 
FPGAs and Machine Learning
inside-BigData.com
 
Accelerate Machine Learning Software on Intel Architecture
Intel® Software
 
Ad

More from Intel® Software (20)

PPTX
AI for All: Biology is eating the world & AI is eating Biology
Intel® Software
 
PPTX
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Intel® Software
 
PDF
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Intel® Software
 
PDF
AI for good: Scaling AI in science, healthcare, and more.
Intel® Software
 
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Intel® Software
 
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Intel® Software
 
PPTX
AWS & Intel Webinar Series - Accelerating AI Research
Intel® Software
 
PPTX
Intel Developer Program
Intel® Software
 
PDF
Intel AIDC Houston Summit - Overview Slides
Intel® Software
 
PDF
AIDC NY: BODO AI Presentation - 09.19.2019
Intel® Software
 
PDF
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Intel® Software
 
PDF
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Intel® Software
 
PDF
AIDC India - AI on IA
Intel® Software
 
PDF
AIDC India - Intel Movidius / Open Vino Slides
Intel® Software
 
PDF
AIDC India - AI Vision Slides
Intel® Software
 
PDF
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
Intel® Software
 
PDF
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Intel® Software
 
PDF
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
Intel® Software
 
PDF
Intel® AI: Parameter Efficient Training
Intel® Software
 
PDF
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks
Intel® Software
 
AI for All: Biology is eating the world & AI is eating Biology
Intel® Software
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Intel® Software
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Intel® Software
 
AI for good: Scaling AI in science, healthcare, and more.
Intel® Software
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Intel® Software
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Intel® Software
 
AWS & Intel Webinar Series - Accelerating AI Research
Intel® Software
 
Intel Developer Program
Intel® Software
 
Intel AIDC Houston Summit - Overview Slides
Intel® Software
 
AIDC NY: BODO AI Presentation - 09.19.2019
Intel® Software
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Intel® Software
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Intel® Software
 
AIDC India - AI on IA
Intel® Software
 
AIDC India - Intel Movidius / Open Vino Slides
Intel® Software
 
AIDC India - AI Vision Slides
Intel® Software
 
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
Intel® Software
 
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Intel® Software
 
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
Intel® Software
 
Intel® AI: Parameter Efficient Training
Intel® Software
 
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks
Intel® Software
 
Ad

Recently uploaded (20)

PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 

Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH 2018 Tech Session

  • 1. Salil Tambe, Adobe Manuj Sabharwal, Intel
  • 2. 2 Agenda • Why Deep Learning inference on Client – Advantages of running Intelligence on Client – Inference use cases on edge platform • Tools to get the best from Intel® Processor Graphics • Case Study • Emerging Deep Learning Technologies • Summary
  • 3. 3 Advantage of running intelligence on client Trust & privacy Network Bandwidth Responsi veness Service Cost 30% 5%20%
  • 4. 4 Intelligence use cases on client - Examples Cyberlink* PowerDirector – Style Transfer2 Unity* ML Agents - Bringing intelligence to game on client4 1https://theblog.adobe.com/premiere-pro-updates-spring-2018/ 2https://www.cyberlink.com/products/creative-design-packs/ai_en_US.html 3 https://github.com/msmsajjadi/frvsr 4https://github.com/Unity-Technologies/ml-agents 1Color Match powered by Adobe® Sensei™ 3State of Art – Frame Recurrent Video super resolution – Sajjadi etl. *Other names and brands may be claimed as the property of others.
  • 5. 5 Baseline Performance with Use cases – (What you may have heard ) • None of the typical topologies hitting real-time inference • Standard framework (e.g. Caffe/Caffe2) with Intel® MKL BLAS and using Intel® AVX2 instructions • Optimization opportunity available 0 5 10 15 20 ResNet-152 SqueezeNet-1.0 VGG19 GoogleNet-v4 ISV-P1 ISV-P2 ISV-P3 FPS Baseline CPU Perf with MKL BLAS Integrated in Standard Framework Typical topologies 3rd Party ISV Projects Standard image input used as describe in “state of art algorithm”. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: - Intel® 6th Generation Core i7-6700 CPU Processor with Intel HD Graphics. OS : Windows* 10 Benchmark results were obtained post implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown.” Implementation of these updates may make these results inapplicable to your device or system. Performance lags to meet requirements for real-time use case *Other names and brands may be claimed as the property of others.
  • 6. 6 openvino™ toolkit Consistent workflow to enable inference on all IA Caffe* TensorFlow* MxNet* Convert & optimize to fit all targets InferenceEngineCommonAPI (C++) Load, infer CPU Plugin GPU Plugin FPGA Plugin Myriad-2 Plugin Hetro–plugin Model Optimizer GNA Plugin Extendibility (C++) Extendibility (OpenCL) Available today .bin IR .XML Convert Generic Optimization Target Optimization Compile Execute  Enable our customers to deploy a trained model on all Intel related architecture:  CPU, GPU, Intel® Movidious™ Myriad™ 2 and other Intel accelerators  Optimize for best execution  Enable user to validate and tune  Easy to use runtime API across all devices ONNX .onnx *Other names and brands may be claimed as the property of others.
  • 7. 7 Why Intel® processor Graphics ? https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf On millions of SoC Free for consumers  No need to buy special HW Better battery life based on use case Better performance based on use case
  • 8. 8 Using the Inference Engine API (this is working code!) IRSerializer reader; reader.Open("GoogleNet.xml"); reader.ReadWeights("GoogleNet.bin"); Network net = reader.GetNetwork(); InfEng engine(TargetDevice::GPU); auto exec = engine.Compile(net); // optionally compilation can be saved: exec.save(“GoogleNet.nnx”); auto inData = make_shared_blob<float>({ 0.1f,0.2f,0.3f,0.4f,0.5f }); auto outData = exec.Infer(inData); const LockedMemory<const float> pfOut = outData->readOnly(); // read output … 1. Read network from IR 2. Select Target 3. Compile & Load 4. Execute…
  • 9. 9 Advantage of using OpenVINO™ • Easy installation on Windows* and Linux* • Visual Studio* integration • C++ Support • Extensibility support • Developers can add new primitives to support new operators • One API support across different hardware • Size overhead < 10MB additional file for enabling for Intel GPU • Model optimizer can fuse layers to give significant improvement on file size *Other names and brands may be claimed as the property of others.
  • 10. 10 performance using cldnn kernels through openvino™ 3.8 19.0 2.3 2.3 3.0 2.0 1.0 17.5 173.7 18.1 13.1 27.0 20.5 9.9 0 20 40 60 80 100 120 140 160 180 200 ResNet-152 SqueezeNet-1.0 VGG19 GoogleNet-v4 ISV-P1 ISV-P2 ISV-P3 FPS Performance Increase due to Intel GPU Baseline clDNN (FP16) 5-10x performance increase with optimized software stack Optimized Software + Hardware == Better Performance Intel GPU provides 5-10x performance increase vs. baseline with no extra cost to hardware and low power Standard image input used as describe in “state of art algorithm”. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: - Intel® 6th Generation Core i7-6700 CPU Processor with Intel HD Graphics. OS : Windows* 10 Benchmark results were obtained post implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown.” Implementation of these updates may make these results inapplicable to your device or system. *Other names and brands may be claimed as the property of others.
  • 11. 11 Adobe® deep matte Proof of concept optimizations on intel GPU using openvino™ *Other names and brands may be claimed as the property of others.
  • 12. 12 Select & Mask in Photoshop* *Other names and brands may be claimed as the property of others.
  • 13. 13 Select & Mask in Photoshop* *Other names and brands may be claimed as the property of others.
  • 14. 14 Matting in Photoshop* *Other names and brands may be claimed as the property of others.
  • 15. 15 Deep Matting Image Input Output Xu et al. Deep Image Matting. CVPR 2017.
  • 17. 17 Tech Transfer Challenges • Resolution (320 x 320) • Model size (80 MB) • Memory • Run time performance • Cross platform support Image Credits: Deep Image Matting [Xu et. al; CVPR 2017]
  • 18. Inference per Tile Conv2 [160 x 160 x 128] Conv1 [320 x 320 x 64] Conv4 [40 x 40 x 512] Alpha Prediction [320 x 320 x 1] Conv3 [80 x 80 x 256] Deconv4 [40 x 40 x 256] Deconv3 [80 x 80 x 128] Deconv1 [320 x 320 x 64] Conv5 [20 x 20 x 512] Deconv5 [20 x 20 x 512] Deconv2 [160 x 160 x 64]
  • 21. 21 Initial implementation performance Common issues  Threading is not optimal  Reordering of data structure between layers increases compute time 2.35 seconds  One Frame Standard image input used as describe in “state of art algorithm”. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: - Intel® 8th Generation 8250U (4C:8T) CPU Processor with Intel HD Graphics. OS : Windows* 10 • Memory Utilized ~3GB
  • 22. 22 Optimizing for intel® Processor graphics
  • 23. 23 Supporting custom layers Custom layers are the layers that are not included into a list of known layers. If your topology contains any layers that are not in the list, the Model Optimizer classifies them as custom Solution Register those layers as extensions to the Model Optimize Registering rules to pass Extension Layer properties from model to IR New primitives are now part of OpenVINO™ Step-1 convert model to intermediate representation (ir) class pooling(Op): op = 'Pooling' def __init__(self, graph, attrs): super().__init__(graph, { 'type': __class__.op, 'op': __class__.op, 'infer': pooling.pooling_infer}, attrs) def supported_attrs(self): return ['stride-y','stride-x','pad-y','pad- x','kernel-y','kernel-x','pool-method','rounding-type'] class CustomPoolingFrontExtractor(FrontExtractorOp): op = 'Pooling' @staticmethod def extract(node): proto_layer = node.pb param = proto_layer.pooling_param mapping_rule = { 'stride-y':param.stride, 'stride-x':param.stride, 'type':'Pooling', 'pool':param.pool, 'kernel-y': param.kernel_size, 'kernel-x': param.kernel_size, 'pool-method':"max", 'rounding-type':"ceil", }
  • 24. 24 Step-2 : integrating code into inference engine InferenceEngine::CNNNetReader net_reader; net_reader.ReadNetwork(model.data(), model.length()));  Read model from memory or directly load file // 256 is a size in bytes! InferenceEngine::TBlob<uint8_t> *weights = new InferenceEngine::TBlob<uint8_t>(InferenceEngine::Precision::U8, InferenceEngine::C, {256}); weights->allocate(); fill_data((float *) weights->buffer(), weights->size() / sizeof(float)); // Fill weights from model InferenceEngine::TBlob<uint8_t>::Ptr weights_ptr = InferenceEngine::TBlob<uint8_t>::Ptr(weights); net_reader.SetWeights(weights_ptr); net_reader.getNetwork(); InputsDataMap inputInfo(network.getInputsInfo()); // Stores all input blobs data BlobMap inputBlobs; ExecutableNetwork executable_network = plugin.LoadNetwork(network, {}); InferRequest infer_request = executable_network.CreateInferRequest(); // Fill input image to data buffer data[image_id * image_size * num_channels + ch * image_size + pid] = imagesData.at(image_id).get()[pid*num_channels + ch]; // Start Infer (This can be Async infer too) Infer_request.Infer(); const Blob::Ptr output_blob = infer_request.GetBlob(firstOutputName); const auto output_data = output_blob->buffer().as<float*>();
  • 25. 25 How performance looks with openvino™ using cldnn OpenVINO™ SDK enables Intel GPU path without change of API call  Free performance boost on millions of PC shipped with integrated GPU Enabled custom kernel through device code  FP16 implementation should give up to 1.8x perf boost CPU GPU (FP32) Package (SoC) Power 26.9W 21.9W Performance 2.35 seconds 291msec 22% less power and 8x faster compared to Intel GPU solution with no additional hardware cost 291msec Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: - Intel® 8th Generation 8250U (4C:8T) CPU Processor with Intel HD Graphics. OS : Windows* 10
  • 26. 26 How about memory usage? ~700MB used for 320x320 image Default Implementation Using OpenVINO™ OpenVINO™ optimized the model for efficient memory which is important for client based platform Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: - Intel® 8th Generation 8250U (4C:8T) CPU Processor with Intel HD Graphics. OS : Windows* 10
  • 27. 27 Examples of Emerging AI Technologies – Federated Learning* Federated learning or Collaborative learning is where multiple devices participate in the machine learning process (training or inferencing). Federated learning decouples storage from the machine learning process * Concept introduced by Google (April 2017) Cloud Device A Device B Device C
  • 28. 28 End-to-end deep learning on intel Store Distribute Deliver Proces s Analyze Ingest Proces s Capture Transmit Ingest Store Process Analyze Process Distribute Delivery Consume On-Device Processing Wi-Fi 5G Ethernet Edge Ingest Processing & Storage Ingest Process Distributio n Process Analyze & Manipulate Storage Edge Delivery Processing & Storage Wi-Fi 5G Ethernet On Device Processing
  • 29. 29 Using optimized libraries , software stacks and SDK can give 7-10x performance boost on same hardware Intel GPU can give boost in performance with lower power compared to CPU with no additional cost to consumer/developer (depending on topologies)  Boost in performance on millions of PC shipped with Intel® Processor Graphics OpenVINO™ makes it simple to start developer on Windows*  It provides better performance and efficient memory for deployment Heterogeneous support of running multiple models on different IP Call to developers/communities  Re-think about bringing inference offline to Intel GPU Summary *Other names and brands may be claimed as the property of others.
  • 31. 31 references • https://software.intel.com/en-us/openvino-toolkit • https://github.com/intel/clDNN • https://github.com/intel/mkl-dnn • https://software.intel.com/en-us/articles/accelerate-deep-learning-inference- with-integrated-intel-processor-graphics-rev-2-0 • https://software.intel.com/en-us/articles/background-on-ai-and-the-move-to- the-edge • https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute- Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf
  • 32. 32 Acronyms • Compute Library for Deep Neural Networks (clDNN) • Intel® Math Kernel Library (Intel® MKL) • Intel® Advanced Vector Extensions 2 (Intel® AVX2) • Intel Architecture (IA) • Intermediate Representation (IR) • Independent software vendor (ISV)
  • 33. 33 Legal Disclaimers and Optimization NoticesNo license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system. Intel, Openvino, Movidius, Myriad and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others © Intel Corporation.

Editor's Notes

  • #7: MxNet = Amazon .XML – NN topology .bin – weights (array of floats) .nif – topology .data – weights in a .tar file format Currently IR format, moving to NIF Q2 Load, infer = load model and do inference Common API for C++ application Hetro = heterogeneous plug-in – can run part on FPGA, part on GNA, … GNA plug-in was released to Amazon GNA-s will be ready later, TBD Generic Optimization – on model Target Optimization – per inference plug-in Each plug-in – target optimization, compile (in inference engine [e.g., layer descriptors for GNA], per target), and execute
  • #29: What it takes to win – End to end etc Ubiquity Xeon footprint + network + dev community reach (optimization)