Skip to content

fix(gpqa): write evaluator output to results.json#31

Open
skyzer wants to merge 1 commit into
hexo-ai:mainfrom
skyzer:fix/gpqa-results-json-contract
Open

fix(gpqa): write evaluator output to results.json#31
skyzer wants to merge 1 commit into
hexo-ai:mainfrom
skyzer:fix/gpqa-results-json-contract

Conversation

@skyzer

@skyzer skyzer commented Jun 7, 2026

Copy link
Copy Markdown

Summary

  • make the bundled GPQA evaluator write results.json by default
  • keep explicit --output overrides unchanged
  • add a regression test for the SIA evaluator artifact contract

Why

run_evaluation() treats gen_*/results.json as the canonical evaluator artifact. GPQA previously wrote evaluation_results.json by default, so evaluation could complete but the orchestrator would still report that results.json was missing.

Test plan

  • python3 -m pytest tests/test_gpqa_evaluator.py -q
  • python3 -m pytest tests/test_gpqa_evaluator.py tests/test_run_evaluation.py tests/test_run_evaluation_outcomes.py -q
  • python3 -m ruff check .
  • python3 -m pytest -q

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 364f9400af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

output_path = args.output
elif args.gen_dir:
output_path = args.gen_dir / "evaluation_results.json"
output_path = args.gen_dir / "results.json"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid reusing evaluator output as submission

When --gen-dir is used, find_submission_file() runs before this write and treats any root-level results*.json as the submission candidate, returning it before checking submission*.json. After the first successful evaluation creates gen_dir/results.json, any rerun on the same generation directory will load the previous evaluator output as the model submission, so answers are absent/misread and the new results can overwrite a correct score with incorrect/missing results. Please exclude the evaluator artifact from submission discovery or otherwise disambiguate it before making it the default output name.

Useful? React with 👍 / 👎.

@selvamHexo selvamHexo requested a review from yogendrahexo June 8, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant