opensource.google.com

Menu

Posts from June 2026

Introducing OpenRL: A self-hosted post-training API for fine-tuning LLMs

Thursday, June 11, 2026

We are pleased to share a research preview of OpenRL, a new open-source project coming out of GKE Labs. OpenRL is a self-hosted training API for fine-tuning LLMs on your own Kubernetes cluster.

Why we built it

If you look at agentic RL on LLMs, it is incredibly easy to get bogged down in system complexity. To run a single RL loop, you have to coordinate a dozen different things: selecting and cleaning datasets, choosing RL environments, debugging training loops, managing reward signals, handling inference mismatches, allocating hardware, and managing infrastructure. Picture looks something like this:

an AI researcher and an infrastructure engineer staring at the hurdles in post training along the way to the summit
Figure shows an AI researcher and an infrastructure engineer staring at the hurdles in post training along the way to the summit.

Each of these is a hard problem. But what makes it more complex is how tightly AI research and infrastructure concerns are mixed together in today's tooling and frameworks.

We believe decoupling the infrastructure from AI research can make these problems more tractable so that infrastructure engineers and AI researchers can independently tackle them. We have seen this pattern with Kubernetes where Kubernetes abstracted out the infrastructure and made application developers and SREs life easier.

So, can you abstract out post training infrastructure? We believe so and drew huge inspiration/validation from Tinker (from Thinking Machines). The Tinker APIs for post training hit that Goldilocks zone where it hides all the post training infrastructure behind four key APIs:

high level components and their interaction in a OpenRL based RL workflow
Figure shows high level components and their interaction in a OpenRL based RL workflow

So the end result of this abstraction is that AI Researchers get full flexibility on their RL loop and infrastructure engineers can focus on scaling, orchestration, and reliability. OpenRL allows you to run the same training APIs but on your own infrastructure. And this decoupling has other interesting benefits.

Sharing GPUs

Traditional RL loops are strictly sequential. The trainer waits for the sampler to finish rollouts, the sampler waits for the environment to score rewards (which is often bound by slow CPU/network tasks), and the whole loop sits blocked. Your expensive GPUs spend a lot of time doing nothing. The abstraction allows running multiple RL jobs and allows infrastructure engineers to pack the training/sampling steps to utilize more of their GPUs. The graph below shows the GPU consumption in OpenRL for running one, two, and three RL jobs concurrently.

The figure shows the trainer/sampler duty cycle in OpenRL for scenarios with 1 RL job, 2RL jobs and 3 RL jobs respectively
The figure shows the trainer/sampler duty cycle in OpenRL for scenarios with 1 RL job, 2RL jobs and 3 RL jobs respectively.

Better UX

Once you separate out the infrastructure behind the APIs, you start to see the gains in user experience of developing the RL loop because AI researchers no longer have to wrangle the complex python dependencies like cuda. When you are doing R&D, you do not have to run the RL loop directly on the machines with GPUs, you can simply run your RL loop on your Mac pointing to the training APIs running on a Kubernetes cluster/VMs.

Autoresearch

We believe that frontier AI research will get more and more automated in the future and abstracting out infrastructure as a building block is key to that. To demonstrate that, we added an autoresearch recipe inspired heavily by karpathy's work. The recipe demonstrates how to conduct parallel experiments to conduct parameter sweep, and improve the reward signal for our text-to-sql recipe for Gemma models.

Figure showing autoresearch UI with multiple AI researchers conducting experiments in parallel in OpenRL
Figure showing autoresearch UI with multiple AI researchers conducting experiments in parallel in OpenRL

What OpenRL is not

  • A managed service. OpenRL is self-hosted and not a managed service. We aim to make it easy for users to deploy and operate it on their Kubernetes clusters.
  • An RL framework. OpenRL gives AI researchers full control over their RL loop.

Get started

We have made it easy to run OpenRL on your Mac, Nvidia GPUs, or on GKE. This allows you to test your RL loop on Mac and when you are ready to scale, you can point the RL loop to the OpenRL endpoint running in the GKE cluster.

Try out our text-to-SQL example for teaching the latest Gemma model SQL here: guides.

One of the benefits of a Tinker compatible endpoint is that you can use Tinker-Cookbook with OpenRL. Tinker-cookbook is one of the best resources for post training infrastructure for RL.

Future steps

We have started with a simple architecture focussing on LoRA fine-tuning and plan to evolve the project in the coming months, so please give it a try and share your feedback. A few things we are very excited to work on:

  • Full parameter fine-tuning
  • Multitenancy (simultaneous RL on different types of base models)

Acknowledgement

We have been inspired by the work done by various open source projects in AI communities, so huge thank you to Thinking Machines, vLLM, PyTorch, prime-rl, verl, SkyRL, and llm-d.

Google joins the Eclipse Foundation as a strategic member to accelerate AI-integrated developer tools

Wednesday, June 10, 2026

A simple image with the Google logo a plus sign and the Eclipse Foundation logo

Collaboration with the Eclipse Foundation will support open infrastructure for AI-integrated developer platforms like Google Antigravity, while advancing broader open source security and regulatory compliance initiatives

As of April 2026, Google has joined the Eclipse Foundation as a Strategic Member, reflecting the company's continued investment in open source technologies and modern developer infrastructure.

As part of this collaboration, Google will additionally sponsor Open VSX and is among the first adopters of the recently announced Open VSX Managed Registry service. Open VSX is the open source, vendor-neutral extension registry for tools built on the VS Code™ extension API. It powers a rapidly growing ecosystem of AI-integrated IDEs, cloud development environments, and developer platforms, including Google Antigravity, AWS's Kiro, Cursor, and, Windsurf among many others.

As a Strategic Member, Google will participate in the Eclipse Foundation's Board of Directors and Technical Advisory Council, helping guide the technical and strategic direction of one of the world's leading open source software foundations.

"The industry is feeling the massive turning point as AI continues to change how developers write, deploy, and maintain software," said amanda casari of Google's Open Source Programs Office and new Eclipse Board member. "Joining The Eclipse Foundation as a Strategic Member ensures that the next generation of AI-integrated developer experiences—including platforms like Google Antigravity—are built in partnership with transparent, vendor-neutral foundations. Open registries, like Open VSX, are critical infrastructure which keep the global developer ecosystem open to everyone."

Google and the Eclipse Foundation share a deep history, having collaborated across numerous initiatives since 2006. This Strategic Membership elevates the relationship and support critical to modern initiatives like Open VSX, Open Regulatory Compliance (ORC), and Adoptium.

"Google has played a pivotal role in open source innovation for two decades," said Mike Milinkovich, Executive Director of the Eclipse Foundation. "Their decision to join as a Strategic Member reflects the growing importance of open collaboration in supporting global regulatory compliance efforts, strengthening open source infrastructure, securing supply chains, and advancing the next generation of AI-integrated developer platforms."

The Eclipse Foundation continues to see explosive growth as adoption accelerates across AI-integrated developer tooling and cloud development environments. The Open VSX registry now scales to meet massive global demand:

  • 300 million+ downloads per month
  • 200 million requests during peak daily traffic
  • 12,000+ hosted extensions from over 8,000 publishers.

Unlocking TPU performance: Deep kernel profiling with XProf

Monday, June 8, 2026

Unlocking TPU performance: Deep kernel profiling with XProf

As machine learning workloads scale to unprecedented heights, developers are increasingly writing highly specialized Tensor Processing Unit (TPU) kernels using frameworks like Pallas, Mosaic, and Triton to maximize hardware performance.

However, customizing high-performance kernels has historically introduced a major engineering challenge: optimization blind spots. To legacy performance profilers, custom compilation paths appear as opaque execution paths. Developers are left with single, massive execution blocks in their trace captures, lacking granular visibility into what is actually occurring inside the chip's internal components. Did a vector processing instruction stall? Was matrix math idle due to data loading bottlenecks?

Traditional profiling relies heavily on compile-time static cost models to estimate kernel efficiency. While helpful for standard operations, these models cannot capture dynamic runtime realities like instruction execution stalls, memory subsystem congestion, or hardware scheduling conflicts.

To open this opaque execution path, we are excited to introduce the Kernel Profiling suite in XProf—a low-level hardware debugging suite engineered specifically for Pallas kernel authoring and optimization on Google TPUs. By combining static compilation tracking with dynamic, sub-microsecond hardware telemetry, XProf Kernel provides the deep transparency required to optimize high-scale ML workloads.

Deep visibility: HLO Graphs & MLIR Inspection

The first step in debugging any custom kernel is understanding how your high-level code is translated by the compiler. When compiling a JAX or PyTorch model, the compiler generates a High-Level Optimizer (HLO) graph. Previously, custom calls inside these graphs remained completely obscured.

XProf's updated Graph Viewer resolves this by exposing the internal compilation logic of these custom regions directly. To unlock this deep visibility, developers must pass the appropriate debug flags to the XLA compilation environment.
--xla_enable_custom_call_region_trace=true
--xla_xprof_register_llo_debug_info=true

Once these flags are active, any trace captured via XProf includes comprehensive compiler metadata. In the XProf Graph Viewer, clicking on a custom-call block reveals an interactive panel titled "Custom Call Text." This displays the raw, lowered MLIR (Multi-Level Intermediate Representation) code generated by the compiler.

A screenshot of the TensorBoard XProf interface displaying an HLO graph, with a Custom Call Text panel open to reveal raw MLIR code
Figure 1: XProf interface displaying an HLO graph, with a "Custom Call Text" panel to reveal raw MLIR code

By displaying the MLIR text side-by-side with high-level source-code representations, developers can immediately verify whether the compiler is correctly fusing operations and structuring memory tiles as intended.

Tracing Instrumented Low-Level Operations (LLO) Analysis

To provide cycle-level execution visibility, XProf exposes Low-Level Operations (LLO) bundle data directly inside the Trace Viewer. An LLO bundle represents the actual machine instructions issued to the TPU core's functional units during every clock cycle.

Through dynamic instrumentation, XProf inserts hardware markers exactly when a LLO bundle region executes. Within the Trace Viewer, this manifests as dedicated, time-aligned execution tracks representing the TPU bundle's slot utilization metrics from static analysis:

  • MXU (Matrix Multiply Unit): Tracks active, busy cycles of high-throughput matrix-multiplication pipelines.
  • Scalar and Vector ALUs: Displays the execution profile of mathematical operations, letting you spot pipeline imbalances.
  • Vector Fills, Loads, Spills, and Stores: Exposes HBM-to-register data movement, critical for identifying bandwidth-throttling bottlenecks.
  • XLU (Cross-Lane Unit): Monitors collective communications and data shuffling across physical TPU cores.
XProf Capture Profile trace viewer interface showing dynamic hardware execution tracks
Figure 2: XProf Capture Profile trace viewer interface showing dynamic hardware execution tracks

Runtime Performance Counter Sampling

While static analysis effectively verifies instruction counts or vector store logic, it remains detached from the dynamic realities of runtime execution. To bridge this gap, XProf introduces fine-grained, periodic performance counter sampling—available starting with TPU v7 (Ironwood). This capability empowers developers to move beyond static estimation and measure precisely how hardware blocks are utilized in real-time, providing the empirical ground truth needed to identify whether compute units are truly active or stalled by memory subsystems.

Consider the optimization of a tiled matrix multiplication (Matmul) kernel. While a static trace might indicate a logically perfect sequence of operations, real-world performance often falters if the Matrix Multiply Unit (MXU) sits idle while awaiting data from High-Bandwidth Memory (HBM). To diagnose and resolve such bottlenecks, developers can utilize a structured three-step profiling workflow:

  1. Set up the Profiling Environment: Configure the TPU v7 (Ironwood) runtime by defining specific hardware counters—such as scalar issues or synchronization waits.
  2. Capture a Kernel Profile: Use the XProf request interface to capture fine-grained performance counters, which can then be visualized as a time-series within the Trace Viewer.
  3. Interpret the Data: Analyze the resulting counters to distinguish between a Memory-Bound Scenario (characterized by massive spikes in sync_wait) and an Optimized Scenario. For instance, implementing triple buffering to overlap memory loads with MXU compute can reduce runtime from 125.5µs to 88µs—a ~30% performance gain validated by a drastic reduction in synchronization events.

By shifting from static code inspection to empirical runtime telemetry, hardware behavior explicitly validates optimization strategies, ensuring every cycle on the silicon is spent productively. For a hands-on example to check out these techniques, please explore our Pallas Matmul w/ Perf Counters demo.

XProf timeline highlighting a comparison between a detailed Runtime Perf Counter section sampling at a 1-microsecond frequency and a Static LLO Region track below it
Figure 3: XProf timeline highlighting a comparison between a detailed "Runtime Perf Counter" section sampling at a 1-microsecond frequency and a "Static LLO Region" track below it

Visualizing the "Utilization Gap"

This dynamic tracking exposes the significant gap left by traditional static analysis tools. A static tool analyzes instructions linearly, completely ignoring time. It might flag an MXU instruction block as "100% Utilized."

In contrast, XProf plots actual hardware execution over time. You might discover that a long-running Scalar ALU operation is stalling the entire execution pipeline, leaving the powerful MXU completely idle. By visualizing these temporal idle gaps, developers can adjust data shapes, memory alignments, and instruction sequencing to maximize compute density.

STATIC ESTIMATION:
[========== Block Execution: MXU Flagged 100% Utilized ==========]

XPROF REAL-WORLD TIMELINE:
├─ [Scalar ALU (Active)] ─┼─ [MXU (Active)] ─┼── [MXU (Idle / Memory Stall)] ──┤
│ Stalling pipeline...     │ Compute phase     │ Starved; waiting for HBM Load    │
Figure 4 : The UI shows the active TPU Core functional unit tracks (MXU, Scalar ALU, Vector ALU, and memory data pipelines) aligned side-by-side with the active framework Ops, exposing exact execution times and real-time idle cycles.

Overall Utilization from Performance Counters

Navigating profiling metrics can be daunting. Relying on metrics calculated via compile-time cost models often misrepresents performance when applied to custom compilation paths. To solve this, XProf establishes a clear Hierarchy of Trust:

                  ┌───────────────────────────────┐
                  │     Absolute Ground Truth     │
                  │  (HBM, Hardware Registers,    │ (100% Trustworthy)
                  │       TPO Metrics, CSRs)      │
                  └───────────────┬───────────────┘
                                  ▼
                  ┌───────────────────────────────┐
                  │       Estimated Metrics       │
                  │   (Program Optimal FLOPs,     │ (Requires caution with
                  │      Goodput Efficiency)      │  custom compiling paths)
                  └───────────────────────────────┘
Figure 5: Hierarchy of Metrics
  1. The Absolute Ground Truth (100% Trustworthy): Metrics derived directly from physical hardware registers (HBM utilization, TPO metrics, unprivileged hardware stats). When profiling custom kernels, these represent physical reality and should be your primary optimization anchors.
  2. Estimated Metrics (Use with Caution): Metrics like "Compared to program optimal FLOPS" or "Goodput efficiency" rely on XLA cost models. Because custom compilation paths bypass standard passes, these metrics can be highly skewed or outright non-functional.

For the unvarnished truth, XProf exposes the Perf Counters View, providing direct, tabular access to over 16,000 raw hardware counters read straight from the TPU silicon.

A screenshot of the XProf Perf Counters tabular view, displaying a list of unprivileged hardware counters alongside their corresponding raw decimal and hexadecimal values
Figure 6: XProf Perf Counters Tabular View

Understanding Trace Tracks: The height of a trace track does not represent a normalized 0-100% percentage. It represents the maximum raw counter value observed in that interval. For example, if a counter increments by 100 cycles over a 500-nanosecond trace window (roughly 1,000 clock cycles on a 2.0 GHz core), it indicates exactly 10% physical utilization of that unit.

To configure and profile the runtime performance counters sampling method, please follow the instructions from OpenXLA Kernel Profiling Instructions.

Advanced Sampling: Event-Triggered Profiling

Previously, dynamic capturing was limited to Periodic Sampling Mode—polling counters based on a host-level timer, which hit a physical resolution floor of 1 microsecond.

           CORE 0           CORE 1           CORE 2           CORE 3
      ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
      │  28 Counters │ │  28 Counters │ │  28 Counters │ │  28 Counters │
      └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
      └─────────────────────────────────────────────────────────────────┘
                            4 x 28 Sparse Matrix
Figure 7: Sparse Matrix Configuration

To capture lightning-fast hardware cycles, XProf now supports External Event-Triggered Mode. The dynamic sampler intercepts physical TPU trace instructions and boundary triggers (such as entering/exiting custom call scopes), allowing for sub-microsecond capture latency and precise attribution.

Developers can configure up to 28 hardware counters per core, distributed across up to four active SparseCores, creating a 4 x 28 profiling matrix that maximizes data variety while protecting workload performance.

Activating this is straightforward via standard JAX JIT profilers:

options = jax.profiler.ProfileOptions()

# Example request for externally triggered collection
options.advanced_configuration = {
"tpu_enable_periodic_counter_sampling" : True,
"tpu_tc_perf_counter_sampling_options" : (
          'is_external_trigger:true scaling:0 counter_size_bits:1 indices:10 indices:11 indices:56 indices:57 indices:58'
),
}

# For periodic sampling, please use interval_us instead of is_external_trigger.

Getting Started

Ready to transition from guessing performance to measuring and optimizing the physical limits of your ML silicon? Explore these open-source resources to get started with XProf Kernel today:

Journey to JPEG XL: How open source experiments shaped the future of image coding

Wednesday, June 3, 2026

Building the Next Generation Image Standard

The internet runs on images. Since the early days of the web, there has been a relentless tension between visual fidelity and bandwidth. For decades, the industry relied on the venerable JPEG standard for images loading fast. It served us remarkably well, but as displays moved to High Dynamic Range (HDR) and Wide Color Gamut (WCG), the format began to show its limits.

The road to JPEG XL (JXL) wasn't a straight line. It was a decade-long exploration, creating a series of milestone projects testing radical ideas in psychovisual modeling, entropy coding, and optimization. Today, as JPEG XL sees rapid adoption across operating systems and professional standards, we’re looking back at the experiments that made it possible.


The Early Foundation: 2011–2017

Our study began with a focus on understanding the limits of existing technology. We didn't start by trying to write a new standard; we started by trying to make the current ones better, and learning their limitations. This allowed us to make the new formalism more flexible and efficient in the right places.

  • WebP Lossless and Brotli: Lossy WebP drew its lineage from video technology, the WebP Lossless (2011) represented an architectural and scoping departure. We debuted the entropy image concept, an innovative method utilizing a secondary image to orchestrate the selection of static entropy codes for the primary visual data. We reapplied this approach later with data-driven context modeling in the Brotli compression format, enabling rich context modeling without slowing decoding.
  • Butteraugli: Around 2014, we realized that raw mathematical compression (PSNR) wasn't enough, and simple psychovisual approximations (SSIM and similar) failed in color-rich environments. We built Butteraugli and the XYB color space to mimic the human visual system's edge detection and opponent-color processes in varying scale, allowing us to compress images more effectively.
  • We pushed the legacy JPEG 1 standard (ISO/IEC 10918, introduced in 1992) to its absolute limits through two key projects: Guetzli and Brunsli. These initiatives provided invaluable insights into the strengths and limitations of traditional JPEG compression methods. Guetzli (2016) is a slow high-density perceptual encoder that used Butteraugli to find the optimal quantization tables, pushing legacy JPEGs to be 20-30% smaller. Brunsli (2015) meanwhile, focuses on lossless recompression, allowing users to repack existing JPEGs into a smaller footprint without losing a single bit of original data. After finishing with JPEG XL standardization, we returned to Guetzli's scope in 2024 and made the encoding much faster and HDR-compatible in Jpegli.

The feedback from these launches, ranging from the technical details of WebP Lossless to the psychovisual audits of Guetzli, proved indispensable. While we already targeted the highest visual fidelity, feedback from detail-critical e-commerce helped us to refine the requirements.


The Convergence: 2017–2019 PIK Era and the 2019 FUIF Integration

By 2017 we had powerful separate tools and it was time to fuse them. In open sourcing PIK we combined the efficiency of Brunsli with the psychovisual optimizations of Guetzli. Further, PIK introduced a real adaptive quantization field and other optimizations. PIK formed our proposal to the ISO standardization body. The committee's final call for proposals pushed toward extreme density, requiring bit rates as low as 0.06 BPP, equivalent to 35 times the compression of internet-quality images and 80 times that of camera output. This expansion of scope necessitated a significant complexification of the format and the encoder, leading to the Variable-block-size Discrete Cosine Transform (VarDCT) architecture that remains central to JPEG XL today.

We proposed to merge our PIK proposal with the FUIF (Free Universal Image Format) proposal from Cloudinary. PIK used Brotli-style distribution selection at encoding time, while FUIF refined codes incrementally during decoding. The final JPEG XL standard became a best-of-both-worlds compromise: we used PIK's faster-to-decode distribution selection with FUIF's sophisticated context trees. The merger represented a departure from conventional one platform driven standardization, and prioritized technical synergy and collaboration.

A flowchart titled 'Building Blocks of the JPEG XL Standard' showing a left-to-right progression across three periods. The first period, 'Early Building Blocks (2011-2017)', contains four boxes: WebP Lossless & Brotli, Butteraugli & XYB, Guetzli, and Brunsli. Arrows point from these early technologies into the second period, 'The Convergence (2017-2019)', which consists of two main boxes: PIK and FUIF. Finally, multiple lines flow from both PIK and FUIF, converging into the third period, 'Final Standard'. This final section features a large orange box labeled 'JXL: JPEG XL Standard', which is described as merging PIK's distribution selection with FUIF's context trees.

JPEG XL Today: An Ecosystem Takes Root

JPEG XL's efficiency, psychovisually-optimized quality, file size, and coding speed, are being noticed. We are seeing bottom-up adoption in various industries, the most demanding fields are leading the way. Because of its ability to handle high bit-depth, high quality and even lossless data efficiently and robustly, JPEG XL has become foundational in several fields:

  • Photography: Used in Digital Negative (DNG 1.7), Apple's ProRAW, and others.
  • Medical: Adopted by DICOM, the international standard for medical images.
  • Publishing: Integration into future versions of the PDF and EPUB standards.

The ecosystem has been maturing rapidly. Adobe's photography software, Apple's iOS, macOS, and visionOS have native support, as do Linux distributions like Ubuntu and Microsoft's JPEG XL Image Extension for Windows. Our libjxl-tiny inspired Shikino High-Tech, Inc. and CAST to release the first commercial JPEG XL encoder IP core for ASIC and FPGA designs, aimed at real-time, low-power image capture. Safari (2023) led among major browsers, while Firefox and Chrome currently maintain experimental support.

Two men in a bright office collaborating at a whiteboard. The board contains a hand-drawn flowchart titled 'VARDCT BLOCK JOINING STRATEGY'. The diagram illustrates small square blocks combining into larger patterned rectangles, connected by arrows. Text labels in the flowchart include 'Decision Logic: Rate-Distortion Cost', 'Merging Criteria', 'Entropy Coding Efficiency', 'Neighboring Blocks', and 'Variable Block Sizes'. The man on the left is pointing to the bottom left of the diagram, while the man on the right, who has long hair and a beard, is writing a mathematical equation on the board with a marker.
JPEG XL design was not only countless hours of optimization, experimentation and eye-balling the results, but also creative discussions at a whiteboard. In this Gemini-reconstructed scene, Luca Versari and Jyrki Alakuijala (left-to-right) debate VarDCT block selection heuristics.

Looking Forward

The story of JPEG XL stands as a testament to the efficacy of long-horizon planning validated by intermediate functional milestones—with minimum-viable prototypes like Guetzli and practical tools like Brunsli and Brotli—that invite feedback from the open-source community. A small research team can innovate by crystallizing solutions through quick iterations, with thousands, if not tens of thousands, of experiments in psychovisual modeling, entropy, coding speed and complexity, and the entire industry can eventually navigate toward a more efficient, beautiful future.

We started by trying to squeeze a few more bytes out of a 1992 JPEG 1 standard; with JPEG XL we hope to have established a foundation for digital imaging that can last for the next three decades.

.