HUD Documentation — Evaluations and RL Environments.

HUD gives you three things: a unified API for every model, a way to turn your code into agent-callable tools, and infrastructure to run evaluations at scale.

Install

# Install CLI
uv tool install hud-python --python 3.12

# Set your API key
hud set HUD_API_KEY=your-key-here

Get your API key at hud.ai/settings/api-keys.

1. Gateway: Any Model, One API

Stop juggling API keys. Point any OpenAI-compatible client at inference.hud.ai and use Claude, GPT, Gemini, or Grok:

from openai import AsyncOpenAI
import os

client = AsyncOpenAI(
    base_url="https://inference.hud.ai",
    api_key=os.environ["HUD_API_KEY"]
)

response = await client.chat.completions.create(
    model="claude-sonnet-4-5",  # or gpt-4o, gemini-2.5-pro, grok-4-1-fast...
    messages=[{"role": "user", "content": "Hello!"}]
)

Every call is traced. View them at hud.ai/home. → More on Gateway

2. Environments: Your Code, Agent-Ready

A production API is one live instance with shared state—you can’t run 1,000 parallel tests without them stepping on each other. Environments spin up fresh for every evaluation: isolated, deterministic, reproducible. Each generates training data. Turn your code into tools agents can call. Define scenarios that evaluate what agents do:

from hud import Environment

env = Environment("my-env")

@env.tool()
def search(query: str) -> str:
    """Search the knowledge base."""
    return db.search(query)

@env.scenario("find-answer")
async def find_answer(question: str):
    answer = yield f"Find the answer to: {question}"
    yield 1.0 if "correct" in answer.lower() else 0.0

Scenarios define the prompt (first yield) and the scoring logic (second yield). The agent runs in between. → More on Environments

3. Evals: Test and Improve

Run your scenario with different models. Compare results:

import hud

task = env("find-answer", question="What is 2+2?")

async with hud.eval(task, variants={"model": ["gpt-4o", "claude-sonnet-4-5"]}, group=5) as ctx:
    response = await client.chat.completions.create(
        model=ctx.variants["model"],
        messages=[{"role": "user", "content": ctx.prompt}]
    )
    await ctx.submit(response.choices[0].message.content)

Variants test different configurations. Groups repeat each to see the distribution. Results show up on hud.ai with scores, traces, and side-by-side comparisons. → More on A/B Evals

4. Deploy and Scale

Push your environment to GitHub, connect it on hud.ai, and run thousands of evals in parallel. Every run generates training data.

hud init                    # Scaffold environment
git push                    # Push to GitHub
# Connect on hud.ai → New → Environment
hud eval my-eval --model gpt-4o --group-size 100

→ More on Deploy

Next Steps

Gateway

One endpoint for every model. Full observability.

Environments

Tools, Scenarios, and local testing.

A/B Evals

Variants, groups, and finding what works.

Deploy

Run at scale. Generate training data.

Community

GitHub

Star the repo and contribute

Discord

Join the community

Enterprise

Building agents at scale? We work with teams on custom environments, benchmarks, and training pipelines. 📅 Book a call · 📧 [email protected]

Get Started

Essentials

Guides

Advanced

SDK Reference

CLI Reference

Community

Introduction

Install

1. Gateway: Any Model, One API

2. Environments: Your Code, Agent-Ready

3. Evals: Test and Improve

4. Deploy and Scale

Next Steps

Gateway

Environments

A/B Evals

Deploy

Community

GitHub

Discord

Enterprise

Get Started

Essentials

Guides

Advanced

SDK Reference

CLI Reference

Community

​Install

​1. Gateway: Any Model, One API

​2. Environments: Your Code, Agent-Ready

​3. Evals: Test and Improve

​4. Deploy and Scale

​Next Steps

Gateway

Environments

A/B Evals

Deploy

​Community

GitHub

Discord

​Enterprise

Install

1. Gateway: Any Model, One API

2. Environments: Your Code, Agent-Ready

3. Evals: Test and Improve

4. Deploy and Scale

Next Steps

Community

Enterprise