Serverless

Making it easier to scale Kafka workloads with Cloud Run worker pools

June 26, 2025

Aniruddh Chaturvedi

Engineering Manager

Adam Kane

Senior Engineering Manager

Try Gemini 2.5

Our most intelligent model is now available on Vertex AI

Try now

Apache Kafka is vital to many event-driven architectures and streaming data pipelines. However, effectively scaling Kafka consumers — the applications processing data from Kafka topics — can be challenging.

Today, we’re excited to discuss two capabilities that make it more efficient and cost-effective to autoscale your Kafka consumer workloads on Cloud Run: Cloud Run worker pools (in public preview), and the open-source Cloud Run Kafka Autoscaler. We announced both of these capabilities at Google Cloud Next ’25.

The challenge: Scaling pull-based workloads

Kafka consumers operate on a “pull” model, where they actively fetch data from Kafka brokers. This architecture fundamentally differs from “push” systems, where data is sent to consumers. Consequently, metrics such as CPU utilization or incoming HTTP request throughput are not sufficient enough to determine processing demand. The true indicator of workload for a Kafka consumer is “offset lag”, which is the delta between the latest message offset available in a topic partition, and the last offset committed by the consumer group for that partition.

Incorporating queue-aware metrics like offset lag (which reside in the Kafka broker) as an autoscaling input can minimize message backlogs and optimize resource utilization.

Cloud Run worker pools for pull-based workloads

To solve the scaling challenge, you’ll first need an environment designed to run these pull-based workloads efficiently. This is where Cloud Run worker pools come in. They provide a purpose-built foundation for running Kafka consumers and other background processors, which was previously a challenging task on Cloud Run.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_-_The_three_main_Cloud_Run_resource_type.max-2000x2000.png

The three main Cloud Run resource types

While Cloud Run services are tailored for request-driven HTTP workloads and Cloud Run jobs for batch tasks that run to completion, worker pools are a distinct resource type well-suited for continuous, non-HTTP, pull-based background processing. They offer specific features that make them ideal for Kafka consumers:

Designed for background processing: Unlike services, worker pools don't require public HTTP endpoints. This reduces the network attack surface and simplifies application code, as you no longer need to manage ports for health checks.
Gradual deployments with instance splitting: Worker pools use deployment strategies tailored for pull-based workloads. Since these workloads don't handle HTTP traffic, rollouts are managed by splitting instances between revisions, rather than splitting traffic. For example, for a worker pool with four instances, you can allocate 25% (one instance) to a new canary revision and 75% (three instances) to the current, stable revision.
Significant cost savings: With worker pools, we charge up to 40% less for CPU and memory, compared to instance-billed Cloud Run services.

Worker pools are available in the Google Cloud CLI (gcloud beta run worker-pools), as an official Terraform resource, and in the reorganized Google Cloud console interface:

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_iGowg61.max-1000x1000.png

The Cloud Run user interface with the new worker pool resource

Queue-aware autoscaling with Kafka Autoscaler

While worker pools provide the right environment, you still need a mechanism to scale based on offset lag. The open-source Cloud Run Kafka Autoscaler is a tool you deploy that works with worker pools (or instance-billed services) to dynamically adjust consumer instances based on real-time demand.

It’s important to note that this is not a managed Google Cloud platform feature – it is an open-source tool that you control and deploy in your own project.

Key benefits:

Scaling based on actual Kafka metrics: The autoscaler connects directly to your Kafka cluster to monitor the total offset lag across partitions in your consumer group, and can also factor in consumer CPU utilization.
Automatically scales consumers down to zero: This eliminates costs during idle periods.
Cost-effective: Deployed as a request-billed Cloud Run service, the autoscaler itself is very cheap to run (less than $1 per month), since it is only active for brief periods during scaling checks.
Fine-grained and configurable scaling behavior: The autoscaler offers precise control over scaling policies, similar to the Kubernetes Horizontal Pod Autoscaler (HPA), allowing you to tailor the scaling behavior to meet your specific cost and performance goals. It provides several configurable levers, including:

- Target lag and CPU utilization thresholds
- A stabilization window to prevent rapid fluctuations in instance counts
- Scaling increment/decrement limits to control the rate at which instances are added or removed in a single scaling action

For a complete list of configuration options, please refer to the project documentation.

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_-_Cloud_Run_Kafka_Autoscaler_architectur.max-1600x1600.png

Cloud Run Kafka Autoscaler architecture diagram

Here’s how it works:

Perform autoscaling check: Cloud Scheduler periodically triggers the autoscaler to initiate a scaling evaluation.
Read Kafka offset lag: Once triggered, the autoscaler connects to the Kafka cluster to read offset lag, and (optionally) to Cloud Monitoring for the consumer’s CPU utilization.
Make scaling decision and actuate: Based on the collected metrics and user-defined scaling policies, the autoscaler computes the optimal number of consumer instances and uses Cloud Run’s manual scaling API to dynamically adjust the instance count without a new deployment.

Generalizing the pattern

The core architectural pattern of the Kafka autoscaler is simple: a Cloud Run service is periodically triggered to read custom metrics and adjust instance counts. This flexible model can be adapted for any pull-based workload, allowing you to scale your Cloud Run worker pools based on the metrics that matter most to your application.

If your application consumes from a different message queue or requires scaling based on your business metrics, you can build a similar dedicated autoscaler. Here are a few examples:

Autoscaling self-hosted Github runners: Dynamically scale your pool of self-hosted runners based on the number of pending jobs in your CI/CD queue. This ensures your builds run without delay while minimizing costs by scaling down — even to zero — when runners are idle.
Scaling on custom Prometheus metrics: Scale your worker pools based on any custom business metric you already expose in Prometheus, such as the number of items in a processing queue or active user sessions. This allows you to tie your infrastructure costs directly to real-time application demand.
Processing a Pub/Sub backlog: Adjust your number of workers based on the number of undelivered messages in a Pub/Sub subscription. This ensures timely message processing during traffic spikes, and saves money during quiet periods.

Cloud Run worker pools and the Kafka Autoscaler bring a new level of flexibility and ease of use to running Kafka, and we’re excited to see what you do with them. To learn more and get started:

Try out the open-source Cloud Run Kafka Autoscaler:

https://github.com/GoogleCloudPlatform/cloud-run-kafka-scaler (Terraform module)

Learn more about Cloud Run worker pools (documentation)
For feedback/questions on the autoscaler, please reach out to [email protected]

If you are looking for a managed service for Apache Kafka, Google Cloud also offers a Managed Service for Apache Kafka with automated cluster management, Kafka Connect, and schema registry (in Preview) with built-in Google Cloud monitoring, logging, and IAM for simplified operations.

^{We would like to thank the Google Cloud team members who helped with this blog post: Andrew Manalo (Software Engineer, Serverless Scaling), Sagar Randive (Product Manager, Serverless) and Matt Larkin (Product Manager, Serverless)}

Posted in

https://storage.googleapis.com/gweb-cloudblog-publish/images/hero_BWYOvBU.max-700x700.png

Serverless

Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

By Steren Giannini • 5-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/pinball.max-700x700.jpg

Application Modernization

Flipping out: Modernizing a classic pinball machine with cloud connectivity

By Drew Brown • 5-minute read

Application Development

Run your AI inference applications on Cloud Run with NVIDIA GPUs

By Sagar Randive • 8-minute read

Serverless

Cloud Functions is now Cloud Run functions — event-driven programming in one unified serverless platform

By James Ma • 4-minute read

Making it easier to scale Kafka workloads with Cloud Run worker pools

Aniruddh Chaturvedi

Adam Kane

Try Gemini 2.5

The challenge: Scaling pull-based workloads

Cloud Run worker pools for pull-based workloads

Queue-aware autoscaling with Kafka Autoscaler

Generalizing the pattern

Related articles

Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

Flipping out: Modernizing a classic pinball machine with cloud connectivity

Run your AI inference applications on Cloud Run with NVIDIA GPUs

Cloud Functions is now Cloud Run functions — event-driven programming in one unified serverless platform