Skip to content

Latest commit

 

History

History

README.md

Deep Agents Evals

End-to-end behavioral evaluation suite for the Deep Agents SDK. Each eval runs an agent against a real LLM, captures the full trajectory (tool calls, file mutations, final response), and scores it on correctness and efficiency.

See EVAL_CATALOG.md for the full list of evals and categories, and MODEL_GROUPS.md for the model catalog used by the eval workflow.

The suite also includes Harbor integration for running sandboxed benchmarks like Terminal Bench 2.0.

Results

Suite CI LangSmith
Evals evals.yml deepagents-evals
Harbor harbor.yml deepagents-harbor

Contributing

Architecture, writing new evals, category system, Harbor setup, and LangSmith integration are all documented in CONTRIBUTING.md.

Resources

  • LangChain Academy — Comprehensive, free courses on LangChain libraries and products, made by the LangChain team.
  • Code of Conduct — community guidelines and standards