fix(gpqa): write evaluator output to results.json#31
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 364f9400af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| output_path = args.output | ||
| elif args.gen_dir: | ||
| output_path = args.gen_dir / "evaluation_results.json" | ||
| output_path = args.gen_dir / "results.json" |
There was a problem hiding this comment.
Avoid reusing evaluator output as submission
When --gen-dir is used, find_submission_file() runs before this write and treats any root-level results*.json as the submission candidate, returning it before checking submission*.json. After the first successful evaluation creates gen_dir/results.json, any rerun on the same generation directory will load the previous evaluator output as the model submission, so answers are absent/misread and the new results can overwrite a correct score with incorrect/missing results. Please exclude the evaluator artifact from submission discovery or otherwise disambiguate it before making it the default output name.
Useful? React with 👍 / 👎.
Summary
results.jsonby default--outputoverrides unchangedWhy
run_evaluation()treatsgen_*/results.jsonas the canonical evaluator artifact. GPQA previously wroteevaluation_results.jsonby default, so evaluation could complete but the orchestrator would still report thatresults.jsonwas missing.Test plan
python3 -m pytest tests/test_gpqa_evaluator.py -qpython3 -m pytest tests/test_gpqa_evaluator.py tests/test_run_evaluation.py tests/test_run_evaluation_outcomes.py -qpython3 -m ruff check .python3 -m pytest -q