SlideShare a Scribd company logo
Bridging the Gap:
Streamlining the Process of
Deploying AI onto Processors
Taesu Kim
CTO
SqueezeBits Inc.
The Challenge of AI Deployment
• Supporting diverse models
• Computer vision
• Larger models (LLMs, diffusion …)
• Multiple hardware targets (GPUs, Mobile, ..)
• Manual conversion scripts needed
• Innovation is getting slowed down
© 2025 SqueezeBits Inc. 2
New
Models &
HWs
Model-Agnostic Conversion Process
• PyTorch 2.0 with several tools to support model-
agnostic deployment
• TorchDynamo: Python-level just-in-time compiler
• TorchInductor: Fast codegen with loop level IR
• AOTAutograd: Ahead-of-time graph tracer / deep
learning compiler integration
• Robust and fast, but sometimes harder to use
© 2025 SqueezeBits Inc. 3
User Model Script
High Level
Computation Graph
(Torch IR)
Low Level
Computation Graph
(ATen / Prim IR)
Compiled Graph
Our solution: OwLite
• Native integration with PyTorch
• Supports all PyTorch operators
• Multiple precisions, formats, and
quantization algorithms
• E.g., INT8, FP8 (E4M3, E3M4)
• Layer-wise fine-grained quantization
• Applicable through simple UI
© 2025 SqueezeBits Inc. 4
Our solution: OwLite
© 2025 SqueezeBits Inc. 5
• Quantization-aware-training support
• Compressed models can be trained
again for accuracy recovery!
• Users can reuse their own data loader
and training scripts.
• Fine-tuned models can be deployed to
target devices with same configuration.
Our solution: OwLite
© 2025 SqueezeBits Inc. 6
torch.nn.Conv2d
torch.nn.BatchNorm2d
torch.nn.SiLU
torch.nn.Conv2d
+
TensorRT
Fused Kernel 1
TensorRT
Fused Kernel 2
Supports Diverse Hardware
OwLite in Vision Applications
© 2025 SqueezeBits Inc. 7
• Available tasks (examples):
• Image classification, object detection,
image segmentation, text classification,
re-identification, face landmark, pose
estimation, and many more
• Supports up to 1B parameter models
• Models with too many nodes to visualize
are currently not supported.
• Bring your own model!
• Even supports transformer-based ones!
(Tested on a NVIDIA A6000, TensorRT)
Consider Deployment from Model Training Stage
© 2025 SqueezeBits Inc. 8
• Models must be trained considering their performance upon deployment.
• Larger models with low precision can outperform smaller models.
• Rapid prototyping and validation are crucial.
Model
Evaluation
Model
Training
Model
Deployment
Data
Ingestion
Service
Monitoring
Ditto: Model-Agnostic Converter for LLMs
© 2025 SqueezeBits Inc. 9
• Model-agnostic converter for LLMs
• Currently supports TensorRT-LLM for
NVIDIA GPUs
• No need for hand-coded conversion
script!
• Converts models in Transformers
library to TensorRT-LLM engines
• Diverse graph optimizations to
support LLM-specific features
Huggingface LLM Model
TensorRT-LLM Engine
Predefined
TRT Network
Handwritten
Checkpoint
Conversion Ditto
Fits on Chips: Revolutionizing LLMs Deployment
© 2025 SqueezeBits Inc. 10
• “Click, Benchmark, Deploy.”
• Diverse serving frameworks & hardware
• vLLM (NVIDIA GPUs, Intel Gaudi)
• TensorRT-LLM (NVIDIA GPUs with Ditto)
• More to come (sglang for GPUs, etc.)
• Tool for non-expert users
• Helps optimize LLM serving – reduce
your LLM serving cost!
Conclusions
• Reduce development time with model-agnostic deployment pipelines.
• Optimize performance by embedding deployment considerations into the
training stage.
• Cut serving costs dramatically by exploring a wide range of configuration
options.
• Leverage existing tools to streamline and accelerate your deployment
workflow.
© 2025 SqueezeBits Inc. 11
Try It Now!
© 2025 SqueezeBits Inc. 12
• Our deployment pipelines are being served as both open-source software and
SaaS toolkits.
• Start deploying your own models today with OwLite and Fits on Chips
• OwLite has free-tier offers for developers (come visit us at our booth #817!)
• Fits on Chips is being served as free. Try it now!
OwLite (Quantization and Deployment) https://owlite.ai
Fits on Chips (LLM Deployment) https://fitsonchips.ai
Torch-TRTLLM (Ditto, Open Source) https://github.com/SqueezeBits/Torch-TRTLLM
SqueezeBits Tech Blog https://blog.squeezebits.com
Come visit us at booth #817 for demo!
© 2025 SqueezeBits Inc. 13
Resources

More Related Content

Similar to “Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,” a Presentation from SqueezeBits (20)

PDF
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Edge AI and Vision Alliance
 
PPTX
python_libraries_for_artificial_intelligence.pptx
salehaalsaleh602
 
PPTX
Machine Learning Toolssssssssssssss.pptx
salehaalsaleh602
 
PDF
Reproducible AI Using PyTorch and MLflow
Databricks
 
PDF
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
Edge AI and Vision Alliance
 
PPTX
Fine tuning large LMs
SylvainGugger
 
PDF
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Chris Fregly
 
PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
PDF
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
PPTX
Deployment of the Machine Learning at the production level
Illarion Khlestov
 
PDF
Open power ddl and lms
Ganesan Narayanasamy
 
PDF
OpenPOWER Workshop in Silicon Valley
Ganesan Narayanasamy
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Chris Fregly
 
PDF
Pytorch A Detailed Overview Agladze Mikhail
ilzobrzan47
 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
PDF
Edge AI: Bringing Intelligence to Embedded Devices
Speck&Tech
 
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Chris Fregly
 
PDF
Running TFLite on Your Mobile Devices, 2020
Koan-Sin Tan
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Edge AI and Vision Alliance
 
python_libraries_for_artificial_intelligence.pptx
salehaalsaleh602
 
Machine Learning Toolssssssssssssss.pptx
salehaalsaleh602
 
Reproducible AI Using PyTorch and MLflow
Databricks
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
Edge AI and Vision Alliance
 
Fine tuning large LMs
SylvainGugger
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Chris Fregly
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
Deployment of the Machine Learning at the production level
Illarion Khlestov
 
Open power ddl and lms
Ganesan Narayanasamy
 
OpenPOWER Workshop in Silicon Valley
Ganesan Narayanasamy
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Chris Fregly
 
Pytorch A Detailed Overview Agladze Mikhail
ilzobrzan47
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
Edge AI: Bringing Intelligence to Embedded Devices
Speck&Tech
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Chris Fregly
 
Running TFLite on Your Mobile Devices, 2020
Koan-Sin Tan
 

More from Edge AI and Vision Alliance (20)

PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
Edge AI and Vision Alliance
 
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
Edge AI and Vision Alliance
 
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
Edge AI and Vision Alliance
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
Edge AI and Vision Alliance
 
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
Edge AI and Vision Alliance
 
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Edge AI and Vision Alliance
 
PDF
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
Edge AI and Vision Alliance
 
PDF
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
PDF
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
PDF
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
PDF
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
 
PDF
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
PDF
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
Edge AI and Vision Alliance
 
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
Edge AI and Vision Alliance
 
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
Edge AI and Vision Alliance
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
Edge AI and Vision Alliance
 
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
Edge AI and Vision Alliance
 
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Edge AI and Vision Alliance
 
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
Edge AI and Vision Alliance
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
 
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
Ad

Recently uploaded (20)

PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Ad

“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,” a Presentation from SqueezeBits

  • 1. Bridging the Gap: Streamlining the Process of Deploying AI onto Processors Taesu Kim CTO SqueezeBits Inc.
  • 2. The Challenge of AI Deployment • Supporting diverse models • Computer vision • Larger models (LLMs, diffusion …) • Multiple hardware targets (GPUs, Mobile, ..) • Manual conversion scripts needed • Innovation is getting slowed down © 2025 SqueezeBits Inc. 2 New Models & HWs
  • 3. Model-Agnostic Conversion Process • PyTorch 2.0 with several tools to support model- agnostic deployment • TorchDynamo: Python-level just-in-time compiler • TorchInductor: Fast codegen with loop level IR • AOTAutograd: Ahead-of-time graph tracer / deep learning compiler integration • Robust and fast, but sometimes harder to use © 2025 SqueezeBits Inc. 3 User Model Script High Level Computation Graph (Torch IR) Low Level Computation Graph (ATen / Prim IR) Compiled Graph
  • 4. Our solution: OwLite • Native integration with PyTorch • Supports all PyTorch operators • Multiple precisions, formats, and quantization algorithms • E.g., INT8, FP8 (E4M3, E3M4) • Layer-wise fine-grained quantization • Applicable through simple UI © 2025 SqueezeBits Inc. 4
  • 5. Our solution: OwLite © 2025 SqueezeBits Inc. 5 • Quantization-aware-training support • Compressed models can be trained again for accuracy recovery! • Users can reuse their own data loader and training scripts. • Fine-tuned models can be deployed to target devices with same configuration.
  • 6. Our solution: OwLite © 2025 SqueezeBits Inc. 6 torch.nn.Conv2d torch.nn.BatchNorm2d torch.nn.SiLU torch.nn.Conv2d + TensorRT Fused Kernel 1 TensorRT Fused Kernel 2 Supports Diverse Hardware
  • 7. OwLite in Vision Applications © 2025 SqueezeBits Inc. 7 • Available tasks (examples): • Image classification, object detection, image segmentation, text classification, re-identification, face landmark, pose estimation, and many more • Supports up to 1B parameter models • Models with too many nodes to visualize are currently not supported. • Bring your own model! • Even supports transformer-based ones! (Tested on a NVIDIA A6000, TensorRT)
  • 8. Consider Deployment from Model Training Stage © 2025 SqueezeBits Inc. 8 • Models must be trained considering their performance upon deployment. • Larger models with low precision can outperform smaller models. • Rapid prototyping and validation are crucial. Model Evaluation Model Training Model Deployment Data Ingestion Service Monitoring
  • 9. Ditto: Model-Agnostic Converter for LLMs © 2025 SqueezeBits Inc. 9 • Model-agnostic converter for LLMs • Currently supports TensorRT-LLM for NVIDIA GPUs • No need for hand-coded conversion script! • Converts models in Transformers library to TensorRT-LLM engines • Diverse graph optimizations to support LLM-specific features Huggingface LLM Model TensorRT-LLM Engine Predefined TRT Network Handwritten Checkpoint Conversion Ditto
  • 10. Fits on Chips: Revolutionizing LLMs Deployment © 2025 SqueezeBits Inc. 10 • “Click, Benchmark, Deploy.” • Diverse serving frameworks & hardware • vLLM (NVIDIA GPUs, Intel Gaudi) • TensorRT-LLM (NVIDIA GPUs with Ditto) • More to come (sglang for GPUs, etc.) • Tool for non-expert users • Helps optimize LLM serving – reduce your LLM serving cost!
  • 11. Conclusions • Reduce development time with model-agnostic deployment pipelines. • Optimize performance by embedding deployment considerations into the training stage. • Cut serving costs dramatically by exploring a wide range of configuration options. • Leverage existing tools to streamline and accelerate your deployment workflow. © 2025 SqueezeBits Inc. 11
  • 12. Try It Now! © 2025 SqueezeBits Inc. 12 • Our deployment pipelines are being served as both open-source software and SaaS toolkits. • Start deploying your own models today with OwLite and Fits on Chips • OwLite has free-tier offers for developers (come visit us at our booth #817!) • Fits on Chips is being served as free. Try it now!
  • 13. OwLite (Quantization and Deployment) https://owlite.ai Fits on Chips (LLM Deployment) https://fitsonchips.ai Torch-TRTLLM (Ditto, Open Source) https://github.com/SqueezeBits/Torch-TRTLLM SqueezeBits Tech Blog https://blog.squeezebits.com Come visit us at booth #817 for demo! © 2025 SqueezeBits Inc. 13 Resources