“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, TensorFlow and Numpy,” a Presentation from OpenMV

Running Accelerated CNNs on
Low-Power Microcontrollers
Using Arm Ethos-U55,
TensorFlow and Numpy
Kwabena W. Agyeman
President
OpenMV, LLC

• Maker of the OpenMV Cam
• A low-power computer vision platform
• Directly integrate into products
• Or licensable for being remixed
• What we do:
• Electrical and PCB design, manufacturing
• High-performance firmware programming
• Camera drivers, DMA, cache coherency, etc.
• SIMD computer vision algorithms, etc.
What is OpenMV?
© 2025 OpenMV, LLC 2
Over 100K
Sold &
Licensed

We make it easy to build a product
Your application
provided
MicroPython Vector (SIMD)
accelerated vision
algorithms
& NPU drivers
Microcontroller support Camera sensors

Outline
• Market background – what’s happening with MCUs?
• Introduce the OpenMV Cam N6 and OpenMV AE3.
• Run ML workloads on microcontrollers using Numpy and TensorFlow.
• Multi-core low-power ML processing using MicroPython.

New AI microcontrollers are here
• Before:
• 600 MHz M7 CPU
• ~1.2 INT8 GOPS ML performance
• ~1 MB RAM on chip
• ~1.2 GB/s bandwidth
• ~66 MBs FLASH access
• No MIPI CSI, ISP, NPU
Run 224x224 YOLOv5 Nano
at 0.4 FPS @ ~0.8 W
• Now:
• 400 MHz M55 CPU
• ~204 INT8 GOPS ML performance
• ~13 MB RAM on chip
• ~3.2 GB/s bandwidth
• ~200 MBs FLASH access
• MIPI CSI, Helium-ISP, NPU
Runs 224x224 YOLOv5 Nano
at 28 FPS @ ~0.25 W
> 200x Better

The market wave
• Running ~2-4 MB YOLO nano models at
30 FPS for < 1 W is now possible.
• Or ~8-10 MB YOLO small models
at 10 FPS for < 1 W.
• With deep sleep power < 1 mW
• For years of application battery Life
• Vision AI for everything, everywhere

Introducing the OpenMV AE3
• 400 MHz SIMD CPU
• 204 GOPS NPU
• 13 MB RAM
• 32 MB FLASH
• 1 MP color global shutter
• 30 FPS, 120 FPS @ VGA
• w/ mic, ToF, accel, gyro
• USB, WiFi, BLE
• GPIO: I2C, SPI, CAN, PWM
• Full power: 60 mA @ 5V (0.25 W)
• Deepsleep: 500 uA @ 5V (2.5 mW)
1” x 1”

And say hello to the OpenMV-N6
32 MB FLASH
@ 400 MB/s
10/100/1000
ethernet
STM32N6
MCU
UHS-I µSD card
socket
(behind camera)
2.4 GHz WiFi
BLE V5.2
Mic and user
RGB LED
1MP 120 FPS
global shutter
color camera
3.7 V LIPO
charger
JTAG &
SWD
USB HS
480 Mb/s
64 MB RAM
@ 800 MB/s
IMU and
user button
JPEG &
H.264
600
GOPS
NPU
800 MHz
SIMD
CPU
MIPI
CSI w/
ISP
Full Power: 150 mA @ 5 V (0.75 W)
Deepsleep: 1 mA @ 5 V (5 mW)

NPU Accelerated TensorFlow +
NumPy Onboard =
Vector Accelerated Python Processing

There are a lot of models
The Problem
• So many vision models!
How can you quickly support one?
• Quantized models may need
tweaking too, custom output
modifications and more!
How to handle this?

NPU accelerated TensorFlow lite for microcontrollers
OpenMV ML Framework
1. Load a model reference to execute
in place from FLASH by the NPU.
2. Create a post-processing object
which will receive the tensor
output from the model.
3. Run inference using the NPU on
image objects and post-process
them in Python with Numpy.
Accepts a list of Tensors and outputs a list
of Tensors for multi-modal inference

Post-process with Numpy on Micropython (1/2)
ARM Helium Accelerated Numpy
1. All YOLO V5 bounding box score
outputs are thresholded at the
same time using ARM Helium
accelerated Numpy code!
2. Non-zero indices are then extracted
to produce a new array of just the
passing bounding boxes.
ARM Helium vector acceleration applied
to Numpy can be reused by all ML code.

Post-process with Numpy on Micropython (2/2)
Finishing Up
• Numpy makes it easy to find the maximum
class score index of every bounding box row
in one line of code!
• Operations to extract the xmin, ymin, xmax,
ymax of all bounding boxes are vectorized
across all bounding box rows! As fast as C!
• Non-Max-Suppression to filter overlapping
bounding boxes, is implemented in Python
using Numpy too!

Multi-core processing in MicroPython
using OpenAMP on the OpenMV AE3

Easy to use multi-core programming using OpenAMP
The dream
1. High-efficiency core runs AI
model on Mic/IMU samples
2. Wake up high-performance core
on detection to process images
3. Transmit any detections to the
cloud and go back to sleep

One Python script, two processors, two MicroPython VMs
What we’ve done
1. Python function decorator used to specify
asyncio co-routines to run on the low-power
core.
2. The callback running on the main core will receive
messages from the asyncio co-routine.
• Low-power core runs multiple asyncio co-
routines connected to multiple callbacks.
3. Main core starts the low-power core and enters
its own main loop.

A processor and NPU for audio detection
46 GOPS available for a Wake Word Detector
1. Low power core has its own MicroPython VM,
stack, heap, 46 GOPS NPU, and Mic.
2. Low power core runs Google MicroSpeech
model to detect a keyword like “OK Google”.
3. Low power core sends any detected label strings
to the main core via the OpenAMP end-point
“ept”.

Which triggers NPU image processing
204 GOPS available for an Object Detector
1. Main core loads YOLO V5 224 nano model
reference from ROM to execute-in-place.
2. Main core wakes up when low-power core
sends wake word.
3. If “Ok Google” the main core takes a picture,
runs YOLOv5 on it to detect objects, and
transmits the results.
4. The main core then goes back to sleep.

What will you create?
• The OpenMV AE3
• 1x 400 MHz Cortex-M55 w/ 204 GOPS NPU
• 1x 160 MHz Cortex-M55 w/ 46 GOPS NPU
• Five sensors:
• 1MP color global shutter camera
• 8x8 400 cm ToF distance sensor
• Accelerometer/gyroscope
• Microphone
• Accelerometer/gyroscope/microphone are accessible by
the low-power core during lightsleep() by the main core.

OpenMV Website
https://openmv.io
OpenMV N6 Product Page
https://openmv.io/collections/cameras
/products/openmv-n6
OpenMV AE3 Product Page
https://openmv.io/collections/cameras
/products/openmv-ae3
Resources
Visit us
at Booth
#909

“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, TensorFlow and Numpy,” a Presentation from OpenMV

More Related Content

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, TensorFlow and Numpy,” a Presentation from OpenMV