“Image Tokenization for Distributed Neural Cascades,” a Presentation from Google and VeriSilicon

Image Tokenization for
Distributed Neural Cascades
Derek Chow
Software Engineer, Google
Shang-Hung Lin
Vice President of NPU Technology
VeriSilicon

What is Tokenization?
Tokenization is the process of converting a sensor modality into a neural encoding.
© 2025 VeriSilicon and Google
2

Examples of Tokenizers
© 2025 VeriSilicon and Google 3

Tokenizer is a Feature Extractor
ResNet101
Classification Detection
Segmentation
• Serves as a feature extractor for a
neural network
• Enables features like classification,
generation, RAG

Multimodal AI

SigLIP / Gemma

Tokenization Creates a Form a Data Compression
• Tokenizer and detokenizer act as a Codec
• Saves power during transmission
• Saves capacity at rest

Compute Memory Bandwidth
High High High
Medium Medium Medium
Medium Low Low
Low Low Low
Low Low Low
Diverse Hardware Ecosystem

World’s Leading Smart Home Products

Can we combine the strengths of
multiple devices for GenAI experiences?
We think yes.

Anatomy of a Neural Cascade
Yes
Yes
No
Tokenizer
Tokenizer
Tokenizer
Image
Tokens
Image
Tokens
Gating
Model

Building a Large Gating Model
• We can build a gating model using a VLM
• Provide a prompt to describe what you
want to detect. i.e.: “Is there an animal
present?”
• Feed tokenized image into VLM
• Check probability of emitting “Yes” or “No”
“Is there an
animal present?”
Text
Embedder
VLM
Image
Tokenizer
P(“Yes”), P(“No”)
VLM
Based
Gating
Model

Distilling a Smaller Gating Model
“Is there an
animal present?”
Text
Embedder
VLM
Image
Tokenizer
P(“Yes”), P(“No”)
Student Gating Model
Teacher
Gating
Model
Gradient
Updates

Composing Models
Image
Tokenizer
Distilled
Animal
Detector
VLM
“Describe what the
animal is doing”
“The squirrel is eating
your avocado!”
Image Tokens
Image Tokens
Embedded
Device

Cascades Beyond Two Devices
Image
Tokens
Audio
Tokens
Health
Tokens
RAG
Queries

Squeezing Neural Cascade Frontend into Small Devices
• Knowledge distillation
17
• Quantization
• Sparsity, weight sharing
• Hybrid architecture

Image Token Compression
• Reducing image token numbers by text prompt
QueCC (ICLR 2025, arxiv:2411.03312)
16x
Compression
Ratio
36x
144x
576x

19
Project Open Se Cura – Edge and Cloud Collaborative
Computing
Extremely low power consumption
• Always on
• Ambient computing
Realizing large models everywhere
• Responsiveness
• Privacy (local & cloud)
• Computational resources
Cloud computing

Kelvin: A RISC-V ML Accelerator for Edge
Kelvin is a RISC-V based ML Accelerator
• Open-source design as part of Open Se Cura
• Provides familiar framework for programming
ML kernels to experts with SIMD/GPU
experience
• Support for RISC-V Vector and Matrix
extensions is in development, targeting 256+
MACs/cycle
• Security extensions via CHERI are on our
roadmap
S
C
A
L
A
R
ML
SIMD SIMD
T
C
M

VeriSilicon AI-Computing IP Product Lineup
Inferencing
Training
Inferencing
VIP9X00
(NPU IP)
CC9X00TC-MP
(GPGPU+NPU IP)
Embedded
Devices
Data Center
Server Chips
Edge Serer
Chips
VIP9X00CC
(NPU+GPGPU IP)
VIP
Nano/PICO
Sub TOPS
Inferencing
Incremental
Training

22
High Efficiency Inference NPU for VLMs & LLMs
Qwen2
1.5B
VIP9000
4 TOPS
16 GB/s
LLaMA2
7B
VIP9000
40 TOPS
128 GB/s
LLaMA3
70B
VIP9400
160 TOPS
512 GB/s
Embedded Devices AI-PC, Mobile Edge Server

Summary and Challenges
Summary
• Tokenizers provide a framework
building multi-modal LLMs
• Distillation based training can
create a gating mechanism to
separate tokenizers from the LLM
• Once separated, compute can be
distributed between embedded
devices and the cloud
Challenges
• Technical
• Memory and compute scaling for
tokenizers and LLMs
• Infrastructure for training
distributed models
• Ecosystem
• Changing model landscape
• Diverse hardware landscape
• Fostering community
23

Gemma
https://ai.google.dev/gemma
Project Open Se Cura
https://www.opensecura.googlesourc
e.com
VeriSilicon NPU IP
https://www.verisilicon.com/en/IPPor
tfolio/VivanteNPUIP
2025 Embedded Vision Summit
Visit us at booth 508!
24
Resources
MAIN
ENTRANCE

“Image Tokenization for Distributed Neural Cascades,” a Presentation from Google and VeriSilicon

More Related Content

Similar to “Image Tokenization for Distributed Neural Cascades,” a Presentation from Google and VeriSilicon (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

“Image Tokenization for Distributed Neural Cascades,” a Presentation from Google and VeriSilicon