fix(gpqa): write evaluator output to results.json by skyzer · Pull Request #31 · hexo-ai/sia

skyzer · 2026-06-07T01:39:29Z

Summary

make the bundled GPQA evaluator write results.json by default
keep explicit --output overrides unchanged
add a regression test for the SIA evaluator artifact contract

Why

run_evaluation() treats gen_*/results.json as the canonical evaluator artifact. GPQA previously wrote evaluation_results.json by default, so evaluation could complete but the orchestrator would still report that results.json was missing.

Test plan

python3 -m pytest tests/test_gpqa_evaluator.py -q
python3 -m pytest tests/test_gpqa_evaluator.py tests/test_run_evaluation.py tests/test_run_evaluation_outcomes.py -q
python3 -m ruff check .
python3 -m pytest -q

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 364f9400af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T01:41:42Z

        output_path = args.output
    elif args.gen_dir:
-        output_path = args.gen_dir / "evaluation_results.json"
+        output_path = args.gen_dir / "results.json"


Avoid reusing evaluator output as submission

When --gen-dir is used, find_submission_file() runs before this write and treats any root-level results*.json as the submission candidate, returning it before checking submission*.json. After the first successful evaluation creates gen_dir/results.json, any rerun on the same generation directory will load the previous evaluator output as the model submission, so answers are absent/misread and the new results can overwrite a correct score with incorrect/missing results. Please exclude the evaluator artifact from submission discovery or otherwise disambiguate it before making it the default output name.

Useful? React with 👍 / 👎.

fix(gpqa): write evaluator output to results.json

364f940

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

selvamHexo requested a review from yogendrahexo June 8, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gpqa): write evaluator output to results.json#31

fix(gpqa): write evaluator output to results.json#31
skyzer wants to merge 1 commit into
hexo-ai:mainfrom
skyzer:fix/gpqa-results-json-contract

skyzer commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skyzer commented Jun 7, 2026

Summary

Why

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant