Skip to content

fix(harness): exit 0 when expected safe-outputs already produced despite numerous permission-denied#41675

Merged
pelikhan merged 2 commits into
mainfrom
copilot/aw-failures-fix-copilot-cli
Jun 26, 2026
Merged

fix(harness): exit 0 when expected safe-outputs already produced despite numerous permission-denied#41675
pelikhan merged 2 commits into
mainfrom
copilot/aw-failures-fix-copilot-cli

Conversation

Copilot AI commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Agent runs that successfully emitted required safe-outputs (e.g. add_comment, submit_pull_request_review) were being marked failure (exit 1) when the permissionDeniedCount >= 5 threshold fired on optional/exploratory bash commands attempted after the core work completed. The classifier never consulted the safe-outputs file, so fully successful tasks produced false-red runs.

Root cause

hasNumerousPermissionDenied was a terminal condition regardless of whether the agent had already produced its expected output. Permission denials from plumbing attempts (python3 -c, stdin redirection, etc.) inflated the counter even when real work was done.

Fix

Before applying the permission_denied terminal verdict in all three harnesses, check whether the safe-outputs file already contains at least one non-diagnostic entry. If it does, suppress the verdict and exit 0.

if (hasNumerousPermissionDenied) {
  if (safeOutputsPath && hasExpectedSafeOutputs(safeOutputsPath, { logger: log })) {
    log(`attempt ${attempt + 1}: detected numerous permission-denied issues but safe-outputs already contain expected output — suppressing terminal verdict (false-red: core work succeeded)`);
    lastExitCode = 0;
    break;
  }
  // no expected outputs → original terminal path
  const deniedCommands = extractDeniedCommands(result.output);
  emitMissingToolPermissionIssue({ deniedCommands, logger: log });
  ...
}

Changes

  • safeoutputs_cli.cjs — adds hasExpectedSafeOutputs(): reads the safe-outputs JSONL and returns true if at least one entry has a type outside DIAGNOSTIC_SAFE_OUTPUT_TYPES (noop, missing_tool, report_incomplete)
  • copilot_harness.cjs, codex_harness.cjs, claude_harness.cjs — guard the hasNumerousPermissionDenied terminal block with hasExpectedSafeOutputs; suppression path exits 0 without emitting missing_tool
  • safeoutputs_cli.test.cjs — 14 unit tests covering all diagnostic/non-diagnostic type combinations, missing file, empty file, and injected-reader paths
  • copilot_harness.test.cjs — 2 integration tests: suppression fires when expected output exists; normal terminal path fires when it does not

…utputs already produced

When a Copilot/Codex/Claude agent successfully emits required safe-outputs (e.g.
add_comment, submit_pull_request_review) but also triggers numerous permission-denied
signals from optional/exploratory bash commands, the copilot-harness was classifying
the run as failure (exit 1) via hasNumerousPermissionDenied — a false-red.

Fix: before applying the terminal permission-denied verdict, check whether the
safe-outputs file already contains at least one non-diagnostic entry (i.e. any type
other than noop/missing_tool/report_incomplete). If expected output was already
produced, suppress the terminal verdict and exit 0 instead.

Changes:
- safeoutputs_cli.cjs: add hasExpectedSafeOutputs() helper + DIAGNOSTIC_SAFE_OUTPUT_TYPES
- copilot_harness.cjs: check hasExpectedSafeOutputs before emitting missing_tool
- codex_harness.cjs: same suppression logic
- claude_harness.cjs: same suppression logic
- safeoutputs_cli.test.cjs: 14 unit tests for hasExpectedSafeOutputs
- copilot_harness.test.cjs: 2 integration tests for suppression behavior

Closes #41636

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix Copilot CLI false-red on successful safe-outputs fix(harness): exit 0 when expected safe-outputs already produced despite numerous permission-denied Jun 26, 2026
Copilot AI requested a review from pelikhan June 26, 2026 12:24
@pelikhan pelikhan marked this pull request as ready for review June 26, 2026 12:25
Copilot AI review requested due to automatic review settings June 26, 2026 12:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the agent harnesses’ terminal failure classification so runs that already produced meaningful safe-outputs (e.g. add_comment, submit_pull_request_review) are not incorrectly marked as failures when a later “numerous permission-denied” threshold is reached by optional/exploratory commands.

Changes:

  • Add hasExpectedSafeOutputs() to detect whether the safe-outputs JSONL contains at least one non-diagnostic entry.
  • Update Copilot/Codex/Claude harnesses to suppress the terminal permission_denied verdict (and exit 0) when expected outputs already exist.
  • Add unit/integration tests covering the new safe-outputs detection and the Copilot harness suppression behavior.
Show a summary per file
File Description
actions/setup/js/safeoutputs_cli.cjs Adds helper to detect non-diagnostic safe-output entries and exports it for harness use.
actions/setup/js/safeoutputs_cli.test.cjs Adds unit tests for hasExpectedSafeOutputs() across valid/invalid/empty/missing JSONL scenarios.
actions/setup/js/copilot_harness.cjs Guards the numerous-permission-denied terminal path by checking safe-outputs first and exiting 0 when work already succeeded.
actions/setup/js/codex_harness.cjs Applies the same permission-denied suppression guard as Copilot harness.
actions/setup/js/claude_harness.cjs Applies the same permission-denied suppression guard as Copilot harness.
actions/setup/js/copilot_harness.test.cjs Adds integration tests for suppression vs. normal terminal behavior under numerous permission-denied conditions.

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 6/6 changed files
  • Comments generated: 2
  • Review effort level: Low

Comment on lines +161 to +166
/**
* Read the safe-outputs JSONL file and check whether it contains at least one
* non-diagnostic entry (i.e. an output that represents real, task-level work).
* Returns true when such an entry exists; false otherwise.
* Used by harnesses to suppress the terminal "numerous permission-denied" verdict
* when the agent already produced the expected output — preventing false-red runs.
Comment on lines +1964 to +1972
const result = spawnSync(process.execPath, ["copilot_harness.cjs", process.execPath, stubPath, "--prompt-file", promptPath], {
cwd: path.dirname(require.resolve("./copilot_harness.cjs")),
env: { ...process.env, COPILOT_HARNESS_STUB_CALLS: callsPath, GH_AW_SAFE_OUTPUTS: safeOutputsPath, GH_AW_SAFEOUTPUTS_CLI: "true" },
encoding: "utf8",
timeout: 15000,
});
// Harness exits 1 because no expected output was produced
expect(result.status).toBe(1);
expect(result.stderr).toContain("detected numerous permission-denied issues — not retrying");
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

PR Code Quality Reviewer completed the code quality review.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Test Quality Sentinel completed test quality analysis.

All safe-output actions (add_comment + submit_pull_request_review APPROVE) were successfully completed in the first agent run of this workflow run (28237898971). This is a duplicate invocation — no further actions needed.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #41675 does not have the 'implementation' label and has 0 new lines of code in business logic directories (threshold: 100).

@github-actions github-actions Bot mentioned this pull request Jun 26, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnose, /tdd, and /zoom-out to a focused bug-fix that eliminates false-red runs caused by permission-denied noise after core work completes.

The root cause analysis is solid, the fix is correctly applied across all three harnesses, and the unit test coverage for hasExpectedSafeOutputs is thorough (14 cases including malformed input, injected readers, and all diagnostic type combinations). Overall this is a well-scoped, well-tested fix.

📋 Key Themes & Findings

Observations

  • Missing integration tests for claude_harness and codex_harness: Both test files exist but the new suppression path is untested there. Since all three harnesses use identical guard code, a future divergence would only be caught by the copilot tests. (/tdd — see inline comment on copilot_harness.test.cjs)
  • DIAGNOSTIC_SAFE_OUTPUT_TYPES not exported: This constant is the semantic contract for the "expected vs diagnostic" distinction; exporting it would allow tests to stay in sync automatically rather than hardcoding the same strings. (/zoom-out — see inline comment on safeoutputs_cli.cjs:159)
  • Partial-output suppression: The "first non-diagnostic entry = core work succeeded" heuristic is intentional and correct for the primary scenario, but worth documenting — a task that emitted only partial outputs would also suppress the verdict. (/diagnose — see inline comment on safeoutputs_cli.cjs:196)
  • null/undefined path not explicitly tested: Minor gap alongside the """ test. (**/tdd`**)

Positive Highlights

  • ✅ Root cause clearly identified and precisely addressed — touches only the terminal-verdict branch
  • ✅ Identical guard logic applied consistently across all three harnesses
  • ✅ Excellent unit test coverage for hasExpectedSafeOutputs including injected readFileSync for testability
  • ✅ Integration tests cover both the suppression path and the negative (no-output) path
  • ✅ Log message at suppression site is descriptive and will help future debugging

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 65.3 AIC · ⌖ 7.37 AIC · ⊞ 6.5K

});
});

describe("permission-denied suppression when expected safe-outputs already produced", () => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] Integration coverage exists only for copilot_harness; claude_harness.test.cjs and codex_harness.test.cjs already exist but have no tests for the new suppression path.

💡 Suggestion

Add parallel integration tests in each of those files (same stub pattern as lines 1912–1945 here):

describe("permission-denied suppression when expected safe-outputs already produced", () => {
  it("exits 0 and suppresses verdict when safe-output already written", () => { ... });
  it("exits 1 when no expected safe-outputs present", () => { ... });
});

All three harnesses use identical guard logic, so divergence from a future refactor or merge conflict would only be caught in the copilot harness today.

* Diagnostic safe-output types that represent infrastructure signals, not task-level work.
* Entries with these types are excluded when checking whether expected outputs were produced.
*/
const DIAGNOSTIC_SAFE_OUTPUT_TYPES = new Set(["noop", "missing_tool", "report_incomplete"]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/zoom-out] DIAGNOSTIC_SAFE_OUTPUT_TYPES is not exported, making it harder for callers and tests to stay in sync if types are added or removed.

💡 Suggestion

Export the constant so external code can enumerate diagnostic types without hardcoding the same strings:

// safeoutputs_cli.cjs
module.exports = {
  // ...
  DIAGNOSTIC_SAFE_OUTPUT_TYPES,
  hasExpectedSafeOutputs,
};

Tests can then import it to drive parametric cases (e.g. for (const t of DIAGNOSTIC_SAFE_OUTPUT_TYPES)) rather than repeating "noop", "missing_tool", "report_incomplete" in separate test files.

if (!trimmed) continue;
try {
const parsed = JSON.parse(trimmed);
if (parsed && parsed.type && !DIAGNOSTIC_SAFE_OUTPUT_TYPES.has(parsed.type)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] The heuristic "any single non-diagnostic entry = core work succeeded" is correct for the primary false-red scenario, but it also fires when an agent produced only a partial set of required outputs before hitting permission denials.

💡 Context

For example, an agent that wrote create_pull_request_review_comment but never reached submit_pull_request_review would still suppress the verdict, causing the harness to exit 0 while only half the task's outputs are present. The downstream safe-output runner would process an incomplete set silently.

This is a known trade-off (the PR description acknowledges "at least one"), but it's worth documenting explicitly in the JSDoc:

/**
 * ...
 * Note: returns true on the *first* non-diagnostic entry found — it does not
 * verify that all required outputs for the task were written. A partial output
 * set will still suppress the permission-denied verdict.
 */

No code change needed; just a documentation nudge so future readers understand the intentional looseness of this check.

expect(hasExpectedSafeOutputs("/tmp/nonexistent-safe-outputs-expected.jsonl")).toBe(false);
});

it("returns false when safeOutputsPath is empty", () => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The test covers the empty-string path but not null or undefined explicitly. While the harnesses guard with safeOutputsPath && before calling the function, the function is exported and could be called without that guard.

💡 Suggestion
it("returns false when safeOutputsPath is null", () => {
  expect(hasExpectedSafeOutputs(null)).toBe(false);
});

it("returns false when safeOutputsPath is undefined", () => {
  expect(hasExpectedSafeOutputs(undefined)).toBe(false);
});

The internal if (!safeOutputsPath) return false guard already handles these, so these tests would just confirm the contract for external callers.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core fix is sound: the hasExpectedSafeOutputs helper is well-tested in isolation and the harness guard correctly suppresses the false-red verdict only when real task outputs exist.

Findings summary

Blocking

None.

Non-blocking (2 comments)

  1. DIAGNOSTIC_SAFE_OUTPUT_TYPES negative allowlist (safeoutputs_cli.cjs line 159) — any new diagnostic type added without updating this set will be treated as real task success, and non-string type values bypass the check entirely. A typeof parsed.type === "string" guard is the minimal fix.
  2. No integration tests for codex/claude harness suppression paths (codex_harness.cjs) — the identical guard is copy-pasted into all three harnesses but integration-tested only in copilot_harness.test.cjs.

Two existing review comments (function name semantics, GH_AW_SAFEOUTPUTS_CLI: "true" portability) are already filed and not duplicated here.

🔎 Code quality review by PR Code Quality Reviewer · 78.1 AIC · ⌖ 7.43 AIC · ⊞ 5.2K

* Diagnostic safe-output types that represent infrastructure signals, not task-level work.
* Entries with these types are excluded when checking whether expected outputs were produced.
*/
const DIAGNOSTIC_SAFE_OUTPUT_TYPES = new Set(["noop", "missing_tool", "report_incomplete"]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative allowlist will silently misclassify any new diagnostic safe-output type added in the future, causing false-green suppression when no real task work was done.

💡 Details

DIAGNOSTIC_SAFE_OUTPUT_TYPES is a closed negative allowlist. If a new diagnostic-only type is ever added to the safe-outputs system (e.g. heartbeat, progress_update, trace) without also updating this set, hasExpectedSafeOutputs will treat it as a real task output and suppress the permission-denied verdict — turning a genuine failure into a silent exit 0.

A secondary robustness gap: parsed.type is not validated as a string, so a malformed entry like {"type": 42} has a truthy type that is not in the Set, causing the function to return true incorrectly.

Option A — minimal fix (typeof guard):

if (parsed && typeof parsed.type === "string" && !DIAGNOSTIC_SAFE_OUTPUT_TYPES.has(parsed.type)) {

Option B — flip to a positive allowlist (explicit, but must be updated as new tools are added):

const REAL_SAFE_OUTPUT_TYPES = new Set([
  "add_comment", "create_pull_request", "submit_pull_request_review",
  "create_pull_request_review_comment", "create_check_run",
  // extend here when new task-output tools are added
]);
if (parsed && typeof parsed.type === "string" && REAL_SAFE_OUTPUT_TYPES.has(parsed.type)) {

Option A is the minimal safe fix regardless of which approach is preferred long-term.

}

if (hasNumerousPermissionDenied) {
// If the agent already produced expected safe-outputs, the permission-denied

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suppression path added here (and in claude_harness.cjs) has no integration tests, so a regression in either harness would ship silently.

💡 Details

copilot_harness.test.cjs grows two integration tests that validate the suppression and the non-suppression paths end-to-end. The identical guard block was copy-pasted into codex_harness.cjs and claude_harness.cjs, but neither harness has a corresponding test. The unit tests in safeoutputs_cli.test.cjs cover hasExpectedSafeOutputs in isolation but do not exercise the harness integration glue (the lastExitCode = 0; break; path, the safeOutputsPath && guard, the log message). A typo, wrong variable, or subtle difference in how the other harnesses handle lastExitCode after the loop would go undetected.

The quickest fix is to add the same two describe-block tests to the test suites for codex_harness and claude_harness, or factor them into a shared helper both suites call.

@github-actions

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

⚠️ Test Quality Score: 78/100 — Acceptable

Analyzed 17 test(s): 17 design, 0 implementation, 0 guideline violation(s). New hasExpectedSafeOutputs function is thoroughly covered with unit tests; integration tests validate the full harness suppression flow.

📊 Metrics & Test Classification (17 tests analyzed)
Metric Value
New/modified tests analyzed 17
✅ Design tests (behavioral contracts) 17 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 13 (76%)
Duplicate test clusters 1
Test inflation detected Yes — copilot_harness.test.cjs 6.6:1, safeoutputs_cli.test.cjs 2.5:1
🚨 Coding-guideline violations 0
Test File Classification Issues Detected
exits 0 — suppresses verdict when expected safe-output produced copilot_harness.test.cjs ✅ Design
exits 1 — emits missing_tool when no expected safe-output copilot_harness.test.cjs ✅ Design
returns true when non-diagnostic entry present safeoutputs_cli.test.cjs ✅ Design
returns true for submit_pull_request_review entry safeoutputs_cli.test.cjs ✅ Design
returns false when all entries are diagnostic types safeoutputs_cli.test.cjs ✅ Design
returns false when only noop entry safeoutputs_cli.test.cjs ✅ Design ⚠️ Dup cluster
returns false when only missing_tool entry safeoutputs_cli.test.cjs ✅ Design ⚠️ Dup cluster
returns false when only report_incomplete entry safeoutputs_cli.test.cjs ✅ Design ⚠️ Dup cluster
returns true when expected entry mixed with diagnostics safeoutputs_cli.test.cjs ✅ Design
returns false when file does not exist safeoutputs_cli.test.cjs ✅ Design
returns false when safeOutputsPath is empty safeoutputs_cli.test.cjs ✅ Design
returns false for empty file safeoutputs_cli.test.cjs ✅ Design
skips malformed lines — returns false (no valid entry) safeoutputs_cli.test.cjs ✅ Design
skips malformed lines — returns true (valid entry follows) safeoutputs_cli.test.cjs ✅ Design
uses injected readFileSync for testability safeoutputs_cli.test.cjs ✅ Design
returns false when injected reader returns only diagnostics safeoutputs_cli.test.cjs ✅ Design
returns false when injected readFileSync throws safeoutputs_cli.test.cjs ✅ Design

Go: 0 (*_test.go); JavaScript: 17 (*.test.cjs). No other languages detected.

Score breakdown: behavioral 40/40 + edge-case 22.9/30 + low-duplication 15/20 + proportional-growth 0/10 = 78

⚠️ Flagged Tests — Minor Issues (1)

Duplicate cluster (safeoutputs_cli.test.cjs) — returns false when only noop, returns false when only missing_tool, and returns false when only report_incomplete share the same assertion structure (makeTempFile(single-type-JSONL)toBe(false)) with trivially different type constants. The preceding test "returns false when all entries are diagnostic types" already covers all three together. Each individual guard is defensible — a regression removing one type from DIAGNOSTIC_SAFE_OUTPUT_TYPES would be caught by exactly one test — but consider a parameterized/table-driven approach if the diagnostic types list grows.

Test inflation — Both test files exceed the 2:1 lines-added ratio: copilot_harness.test.cjs (66 test lines / 10 prod lines = 6.6×) and safeoutputs_cli.test.cjs (133/53 = 2.5×). This is contextually expected: the two integration tests require full process scaffolding (temp dirs, stub scripts, spawnSync invocations) that naturally inflates line count relative to the compact production guard. Not actionable.

Verdict

Check passed. 0% implementation tests (threshold: 30%). All 17 new tests verify observable behavior. The two integration tests exercise the full harness flow end-to-end using real process spawning and filesystem I/O — no mocks. Unit tests cover all boundary conditions for hasExpectedSafeOutputs including missing files, empty paths, malformed JSON, and dependency injection.

🧪 Test quality analysis by Test Quality Sentinel · 109.4 AIC · ⌖ 23.8 AIC · ⊞ 8.4K ·

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 78/100. Test quality is acceptable — 0% of new tests are implementation tests (threshold: 30%).

@pelikhan pelikhan merged commit a865589 into main Jun 26, 2026
53 of 61 checks passed
@pelikhan pelikhan deleted the copilot/aw-failures-fix-copilot-cli branch June 26, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants