Glossary

Real-time Inference

Discover how real-time inference with Ultralytics YOLO enables instant predictions for AI applications like autonomous driving and security systems.

Real-time inference is the process of using a trained machine learning (ML) model to make predictions on new, live data with minimal delay. In the context of AI and computer vision (CV), this means the system can process information—like a video stream—and generate an output almost instantaneously. The goal is to make the inference latency low enough that the results are immediately useful for decision-making. This capability is crucial for applications where timing is critical, transforming how industries from automotive to healthcare leverage AI.

Real-time Inference Vs. Batch Inference

It is important to distinguish real-time inference from batch inference. The key difference lies in how data is processed.

Real-time Inference: Processes data as it is generated or received, typically one input or a small stream at a time. The priority is minimizing the delay (latency) between input and output. This is essential for interactive and time-sensitive systems.
Batch Inference: Involves collecting data over a period and processing it all at once in a large batch. This approach prioritizes maximizing throughput (the amount of data processed over time) rather than minimizing latency. Batch processing is suitable for non-urgent tasks like daily report generation or periodic analysis of large datasets.

While both use a trained model to make predictions, their use cases are fundamentally different based on the urgency of the results.

Applications in The Real World

The ability to make instant decisions enables a wide range of powerful applications across various sectors.

Autonomous Systems: In self-driving cars, real-time inference is a matter of safety. Models must perform object detection to identify pedestrians, other vehicles, and road signs in milliseconds to navigate safely and avoid collisions. Similarly, drones and robots rely on it for navigation and interaction with their environment.
Smart Manufacturing: On a production line, cameras equipped with AI can perform real-time quality control. A model like Ultralytics YOLO11 can detect defects in products moving on a conveyor belt, allowing for their immediate removal. This is a core component of modern AI in manufacturing.
Interactive Healthcare: During a surgical procedure, a model could analyze live video from a camera to provide real-time guidance to the surgeon. In diagnostic settings, real-time medical image analysis can help doctors identify anomalies faster during live scans.
Smart Surveillance: Modern security systems use real-time inference to analyze video feeds and identify potential threats, such as unauthorized entry or abandoned packages, triggering immediate alerts. This moves beyond simple recording to active, intelligent monitoring.

Achieving Real-time Performance

Making models run fast enough for real-time computing applications often requires significant optimization:

Model Optimization: Techniques like model quantization (reducing the precision of model weights) and model pruning (removing redundant parts of the model) reduce computational load and memory usage.
Hardware Acceleration: Utilizing specialized hardware such as GPUs, TPUs (Tensor Processing Units), or dedicated AI accelerators on edge devices (e.g., NVIDIA Jetson, Google Coral Edge TPU) can dramatically speed up computations. Edge computing itself is crucial for processing data locally with minimal delay.
Efficient Inference Engines: Software libraries and runtimes like TensorRT, OpenVINO, ONNX Runtime, and frameworks like PyTorch or TensorFlow provide optimized execution paths for trained models. An inference engine is specifically designed to run models efficiently for prediction.

Models like Ultralytics YOLO are designed with efficiency and accuracy in mind, making them well-suited for real-time object detection tasks. Platforms like Ultralytics HUB provide tools to train, optimize (e.g., export to ONNX or TensorRT formats), and deploy models, facilitating the implementation of real-time inference solutions across various deployment options.

Real-time Inference

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

Train YOLO models simply with Ultralytics HUB

Real-time Inference Vs. Batch Inference

Applications in The Real World

Achieving Real-time Performance

Read more in this category

Key highlights from Ultralytics at WAIC 2025 in Shanghai

How is tea made using technologies like Vision AI?

Bringing Ultralytics YOLO11 to Apple devices via CoreML

Join the Ultralytics community