End-to-end behavioral evaluation suite for the Deep Agents SDK. Each eval runs an agent against a real LLM, captures the full trajectory (tool calls, file mutations, final response), and scores it on correctness and efficiency.
See EVAL_CATALOG.md for the full list of evals and categories, and MODEL_GROUPS.md for the model catalog used by the eval workflow.
The suite also includes Harbor integration for running sandboxed benchmarks like Terminal Bench 2.0.
| Suite | CI | LangSmith |
|---|---|---|
| Evals | evals.yml | deepagents-evals |
| Harbor | harbor.yml | deepagents-harbor |
Architecture, writing new evals, category system, Harbor setup, and LangSmith integration are all documented in CONTRIBUTING.md.
- LangChain Academy — Comprehensive, free courses on LangChain libraries and products, made by the LangChain team.
- Code of Conduct — community guidelines and standards