fix(harness): exit 0 when expected safe-outputs already produced despite numerous permission-denied by Copilot · Pull Request #41675 · github/gh-aw

Copilot · 2026-06-26T11:59:55Z

Agent runs that successfully emitted required safe-outputs (e.g. add_comment, submit_pull_request_review) were being marked failure (exit 1) when the permissionDeniedCount >= 5 threshold fired on optional/exploratory bash commands attempted after the core work completed. The classifier never consulted the safe-outputs file, so fully successful tasks produced false-red runs.

Root cause

hasNumerousPermissionDenied was a terminal condition regardless of whether the agent had already produced its expected output. Permission denials from plumbing attempts (python3 -c, stdin redirection, etc.) inflated the counter even when real work was done.

Fix

Before applying the permission_denied terminal verdict in all three harnesses, check whether the safe-outputs file already contains at least one non-diagnostic entry. If it does, suppress the verdict and exit 0.

if (hasNumerousPermissionDenied) {
  if (safeOutputsPath && hasExpectedSafeOutputs(safeOutputsPath, { logger: log })) {
    log(`attempt ${attempt + 1}: detected numerous permission-denied issues but safe-outputs already contain expected output — suppressing terminal verdict (false-red: core work succeeded)`);
    lastExitCode = 0;
    break;
  }
  // no expected outputs → original terminal path
  const deniedCommands = extractDeniedCommands(result.output);
  emitMissingToolPermissionIssue({ deniedCommands, logger: log });
  ...
}

Changes

safeoutputs_cli.cjs — adds hasExpectedSafeOutputs(): reads the safe-outputs JSONL and returns true if at least one entry has a type outside DIAGNOSTIC_SAFE_OUTPUT_TYPES (noop, missing_tool, report_incomplete)
copilot_harness.cjs, codex_harness.cjs, claude_harness.cjs — guard the hasNumerousPermissionDenied terminal block with hasExpectedSafeOutputs; suppression path exits 0 without emitting missing_tool
safeoutputs_cli.test.cjs — 14 unit tests covering all diagnostic/non-diagnostic type combinations, missing file, empty file, and injected-reader paths
copilot_harness.test.cjs — 2 integration tests: suppression fires when expected output exists; normal terminal path fires when it does not

…utputs already produced When a Copilot/Codex/Claude agent successfully emits required safe-outputs (e.g. add_comment, submit_pull_request_review) but also triggers numerous permission-denied signals from optional/exploratory bash commands, the copilot-harness was classifying the run as failure (exit 1) via hasNumerousPermissionDenied — a false-red. Fix: before applying the terminal permission-denied verdict, check whether the safe-outputs file already contains at least one non-diagnostic entry (i.e. any type other than noop/missing_tool/report_incomplete). If expected output was already produced, suppress the terminal verdict and exit 0 instead. Changes: - safeoutputs_cli.cjs: add hasExpectedSafeOutputs() helper + DIAGNOSTIC_SAFE_OUTPUT_TYPES - copilot_harness.cjs: check hasExpectedSafeOutputs before emitting missing_tool - codex_harness.cjs: same suppression logic - claude_harness.cjs: same suppression logic - safeoutputs_cli.test.cjs: 14 unit tests for hasExpectedSafeOutputs - copilot_harness.test.cjs: 2 integration tests for suppression behavior Closes #41636 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot

Pull request overview

This PR adjusts the agent harnesses’ terminal failure classification so runs that already produced meaningful safe-outputs (e.g. add_comment, submit_pull_request_review) are not incorrectly marked as failures when a later “numerous permission-denied” threshold is reached by optional/exploratory commands.

Changes:

Add hasExpectedSafeOutputs() to detect whether the safe-outputs JSONL contains at least one non-diagnostic entry.
Update Copilot/Codex/Claude harnesses to suppress the terminal permission_denied verdict (and exit 0) when expected outputs already exist.
Add unit/integration tests covering the new safe-outputs detection and the Copilot harness suppression behavior.

Show a summary per file

File	Description
actions/setup/js/safeoutputs_cli.cjs	Adds helper to detect non-diagnostic safe-output entries and exports it for harness use.
actions/setup/js/safeoutputs_cli.test.cjs	Adds unit tests for `hasExpectedSafeOutputs()` across valid/invalid/empty/missing JSONL scenarios.
actions/setup/js/copilot_harness.cjs	Guards the numerous-permission-denied terminal path by checking safe-outputs first and exiting 0 when work already succeeded.
actions/setup/js/codex_harness.cjs	Applies the same permission-denied suppression guard as Copilot harness.
actions/setup/js/claude_harness.cjs	Applies the same permission-denied suppression guard as Copilot harness.
actions/setup/js/copilot_harness.test.cjs	Adds integration tests for suppression vs. normal terminal behavior under numerous permission-denied conditions.

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 6/6 changed files
Comments generated: 2
Review effort level: Low

+/**
+ * Read the safe-outputs JSONL file and check whether it contains at least one
+ * non-diagnostic entry (i.e. an output that represents real, task-level work).
+ * Returns true when such an entry exists; false otherwise.
+ * Used by harnesses to suppress the terminal "numerous permission-denied" verdict
+ * when the agent already produced the expected output — preventing false-red runs.


+      const result = spawnSync(process.execPath, ["copilot_harness.cjs", process.execPath, stubPath, "--prompt-file", promptPath], {
+        cwd: path.dirname(require.resolve("./copilot_harness.cjs")),
+        env: { ...process.env, COPILOT_HARNESS_STUB_CALLS: callsPath, GH_AW_SAFE_OUTPUTS: safeOutputsPath, GH_AW_SAFEOUTPUTS_CLI: "true" },
+        encoding: "utf8",
+        timeout: 15000,
+      });
+      // Harness exits 1 because no expected output was produced
+      expect(result.status).toBe(1);
+      expect(result.stderr).toContain("detected numerous permission-denied issues — not retrying");


github-actions · 2026-06-26T12:40:39Z

✅ PR Code Quality Reviewer completed the code quality review.

github-actions · 2026-06-26T12:40:42Z

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

github-actions · 2026-06-26T12:40:49Z

✅ Test Quality Sentinel completed test quality analysis.

All safe-output actions (add_comment + submit_pull_request_review APPROVE) were successfully completed in the first agent run of this workflow run (28237898971). This is a duplicate invocation — no further actions needed.

github-actions · 2026-06-26T12:41:37Z

✅ Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #41675 does not have the 'implementation' label and has 0 new lines of code in business logic directories (threshold: 100).

github-actions

Skills-Based Review 🧠

Applied /diagnose, /tdd, and /zoom-out to a focused bug-fix that eliminates false-red runs caused by permission-denied noise after core work completes.

The root cause analysis is solid, the fix is correctly applied across all three harnesses, and the unit test coverage for hasExpectedSafeOutputs is thorough (14 cases including malformed input, injected readers, and all diagnostic type combinations). Overall this is a well-scoped, well-tested fix.

📋 Key Themes & Findings

Observations

Missing integration tests for claude_harness and codex_harness: Both test files exist but the new suppression path is untested there. Since all three harnesses use identical guard code, a future divergence would only be caught by the copilot tests. (/tdd — see inline comment on copilot_harness.test.cjs)
DIAGNOSTIC_SAFE_OUTPUT_TYPES not exported: This constant is the semantic contract for the "expected vs diagnostic" distinction; exporting it would allow tests to stay in sync automatically rather than hardcoding the same strings. (/zoom-out — see inline comment on safeoutputs_cli.cjs:159)
Partial-output suppression: The "first non-diagnostic entry = core work succeeded" heuristic is intentional and correct for the primary scenario, but worth documenting — a task that emitted only partial outputs would also suppress the verdict. (/diagnose — see inline comment on safeoutputs_cli.cjs:196)
null/undefined path not explicitly tested: Minor gap alongside the """ test. (**/tdd`**)

Positive Highlights

✅ Root cause clearly identified and precisely addressed — touches only the terminal-verdict branch
✅ Identical guard logic applied consistently across all three harnesses
✅ Excellent unit test coverage for hasExpectedSafeOutputs including injected readFileSync for testability
✅ Integration tests cover both the suppression path and the negative (no-output) path
✅ Log message at suppression site is descriptive and will help future debugging

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 65.3 AIC · ⌖ 7.37 AIC · ⊞ 6.5K

github-actions · 2026-06-26T12:48:39Z

    });
  });
+
+  describe("permission-denied suppression when expected safe-outputs already produced", () => {


[/tdd] Integration coverage exists only for copilot_harness; claude_harness.test.cjs and codex_harness.test.cjs already exist but have no tests for the new suppression path.

💡 Suggestion

Add parallel integration tests in each of those files (same stub pattern as lines 1912–1945 here):

describe("permission-denied suppression when expected safe-outputs already produced", () => { it("exits 0 and suppresses verdict when safe-output already written", () => { ... }); it("exits 1 when no expected safe-outputs present", () => { ... }); });

All three harnesses use identical guard logic, so divergence from a future refactor or merge conflict would only be caught in the copilot harness today.

github-actions · 2026-06-26T12:48:39Z

+ * Diagnostic safe-output types that represent infrastructure signals, not task-level work.
+ * Entries with these types are excluded when checking whether expected outputs were produced.
+ */
+const DIAGNOSTIC_SAFE_OUTPUT_TYPES = new Set(["noop", "missing_tool", "report_incomplete"]);


[/zoom-out] DIAGNOSTIC_SAFE_OUTPUT_TYPES is not exported, making it harder for callers and tests to stay in sync if types are added or removed.

💡 Suggestion

Export the constant so external code can enumerate diagnostic types without hardcoding the same strings:

// safeoutputs_cli.cjs module.exports = { // ... DIAGNOSTIC_SAFE_OUTPUT_TYPES, hasExpectedSafeOutputs, };

Tests can then import it to drive parametric cases (e.g. for (const t of DIAGNOSTIC_SAFE_OUTPUT_TYPES)) rather than repeating "noop", "missing_tool", "report_incomplete" in separate test files.

github-actions · 2026-06-26T12:48:40Z

+    if (!trimmed) continue;
+    try {
+      const parsed = JSON.parse(trimmed);
+      if (parsed && parsed.type && !DIAGNOSTIC_SAFE_OUTPUT_TYPES.has(parsed.type)) {


[/diagnose] The heuristic "any single non-diagnostic entry = core work succeeded" is correct for the primary false-red scenario, but it also fires when an agent produced only a partial set of required outputs before hitting permission denials.

💡 Context

For example, an agent that wrote create_pull_request_review_comment but never reached submit_pull_request_review would still suppress the verdict, causing the harness to exit 0 while only half the task's outputs are present. The downstream safe-output runner would process an incomplete set silently.

This is a known trade-off (the PR description acknowledges "at least one"), but it's worth documenting explicitly in the JSDoc:

/** * ... * Note: returns true on the *first* non-diagnostic entry found — it does not * verify that all required outputs for the task were written. A partial output * set will still suppress the permission-denied verdict. */

No code change needed; just a documentation nudge so future readers understand the intentional looseness of this check.

github-actions · 2026-06-26T12:48:40Z

+      expect(hasExpectedSafeOutputs("/tmp/nonexistent-safe-outputs-expected.jsonl")).toBe(false);
+    });
+
+    it("returns false when safeOutputsPath is empty", () => {


[/tdd] The test covers the empty-string path but not null or undefined explicitly. While the harnesses guard with safeOutputsPath && before calling the function, the function is exported and could be called without that guard.

💡 Suggestion

it("returns false when safeOutputsPath is null", () => { expect(hasExpectedSafeOutputs(null)).toBe(false); }); it("returns false when safeOutputsPath is undefined", () => { expect(hasExpectedSafeOutputs(undefined)).toBe(false); });

The internal if (!safeOutputsPath) return false guard already handles these, so these tests would just confirm the contract for external callers.

github-actions

The core fix is sound: the hasExpectedSafeOutputs helper is well-tested in isolation and the harness guard correctly suppresses the false-red verdict only when real task outputs exist.

Findings summary

Blocking

None.

Non-blocking (2 comments)

DIAGNOSTIC_SAFE_OUTPUT_TYPES negative allowlist (safeoutputs_cli.cjs line 159) — any new diagnostic type added without updating this set will be treated as real task success, and non-string type values bypass the check entirely. A typeof parsed.type === "string" guard is the minimal fix.
No integration tests for codex/claude harness suppression paths (codex_harness.cjs) — the identical guard is copy-pasted into all three harnesses but integration-tested only in copilot_harness.test.cjs.

Two existing review comments (function name semantics, GH_AW_SAFEOUTPUTS_CLI: "true" portability) are already filed and not duplicated here.

🔎 Code quality review by PR Code Quality Reviewer · 78.1 AIC · ⌖ 7.43 AIC · ⊞ 5.2K

github-actions · 2026-06-26T12:49:07Z

+ * Diagnostic safe-output types that represent infrastructure signals, not task-level work.
+ * Entries with these types are excluded when checking whether expected outputs were produced.
+ */
+const DIAGNOSTIC_SAFE_OUTPUT_TYPES = new Set(["noop", "missing_tool", "report_incomplete"]);


Negative allowlist will silently misclassify any new diagnostic safe-output type added in the future, causing false-green suppression when no real task work was done.

💡 Details

DIAGNOSTIC_SAFE_OUTPUT_TYPES is a closed negative allowlist. If a new diagnostic-only type is ever added to the safe-outputs system (e.g. heartbeat, progress_update, trace) without also updating this set, hasExpectedSafeOutputs will treat it as a real task output and suppress the permission-denied verdict — turning a genuine failure into a silent exit 0.

A secondary robustness gap: parsed.type is not validated as a string, so a malformed entry like {"type": 42} has a truthy type that is not in the Set, causing the function to return true incorrectly.

Option A — minimal fix (typeof guard):

if (parsed && typeof parsed.type === "string" && !DIAGNOSTIC_SAFE_OUTPUT_TYPES.has(parsed.type)) {

Option B — flip to a positive allowlist (explicit, but must be updated as new tools are added):

const REAL_SAFE_OUTPUT_TYPES = new Set([ "add_comment", "create_pull_request", "submit_pull_request_review", "create_pull_request_review_comment", "create_check_run", // extend here when new task-output tools are added ]); if (parsed && typeof parsed.type === "string" && REAL_SAFE_OUTPUT_TYPES.has(parsed.type)) {

Option A is the minimal safe fix regardless of which approach is preferred long-term.

github-actions · 2026-06-26T12:49:07Z

    }

    if (hasNumerousPermissionDenied) {
+      // If the agent already produced expected safe-outputs, the permission-denied


The suppression path added here (and in claude_harness.cjs) has no integration tests, so a regression in either harness would ship silently.

💡 Details

copilot_harness.test.cjs grows two integration tests that validate the suppression and the non-suppression paths end-to-end. The identical guard block was copy-pasted into codex_harness.cjs and claude_harness.cjs, but neither harness has a corresponding test. The unit tests in safeoutputs_cli.test.cjs cover hasExpectedSafeOutputs in isolation but do not exercise the harness integration glue (the lastExitCode = 0; break; path, the safeOutputsPath && guard, the log message). A typo, wrong variable, or subtle difference in how the other harnesses handle lastExitCode after the loop would go undetected.

The quickest fix is to add the same two describe-block tests to the test suites for codex_harness and claude_harness, or factor them into a shared helper both suites call.

github-actions · 2026-06-26T12:51:34Z

🧪 Test Quality Sentinel Report

⚠️ Test Quality Score: 78/100 — Acceptable

Analyzed 17 test(s): 17 design, 0 implementation, 0 guideline violation(s). New hasExpectedSafeOutputs function is thoroughly covered with unit tests; integration tests validate the full harness suppression flow.

📊 Metrics & Test Classification (17 tests analyzed)

Metric	Value
New/modified tests analyzed	17
✅ Design tests (behavioral contracts)	17 (100%)
⚠️ Implementation tests (low value)	0 (0%)
Tests with error/edge cases	13 (76%)
Duplicate test clusters	1
Test inflation detected	Yes — `copilot_harness.test.cjs` 6.6:1, `safeoutputs_cli.test.cjs` 2.5:1
🚨 Coding-guideline violations	0

Test	File	Classification	Issues Detected
`exits 0 — suppresses verdict when expected safe-output produced`	`copilot_harness.test.cjs`	✅ Design	—
`exits 1 — emits missing_tool when no expected safe-output`	`copilot_harness.test.cjs`	✅ Design	—
`returns true when non-diagnostic entry present`	`safeoutputs_cli.test.cjs`	✅ Design	—
`returns true for submit_pull_request_review entry`	`safeoutputs_cli.test.cjs`	✅ Design	—
`returns false when all entries are diagnostic types`	`safeoutputs_cli.test.cjs`	✅ Design	—
`returns false when only noop entry`	`safeoutputs_cli.test.cjs`	✅ Design	⚠️ Dup cluster
`returns false when only missing_tool entry`	`safeoutputs_cli.test.cjs`	✅ Design	⚠️ Dup cluster
`returns false when only report_incomplete entry`	`safeoutputs_cli.test.cjs`	✅ Design	⚠️ Dup cluster
`returns true when expected entry mixed with diagnostics`	`safeoutputs_cli.test.cjs`	✅ Design	—
`returns false when file does not exist`	`safeoutputs_cli.test.cjs`	✅ Design	—
`returns false when safeOutputsPath is empty`	`safeoutputs_cli.test.cjs`	✅ Design	—
`returns false for empty file`	`safeoutputs_cli.test.cjs`	✅ Design	—
`skips malformed lines — returns false (no valid entry)`	`safeoutputs_cli.test.cjs`	✅ Design	—
`skips malformed lines — returns true (valid entry follows)`	`safeoutputs_cli.test.cjs`	✅ Design	—
`uses injected readFileSync for testability`	`safeoutputs_cli.test.cjs`	✅ Design	—
`returns false when injected reader returns only diagnostics`	`safeoutputs_cli.test.cjs`	✅ Design	—
`returns false when injected readFileSync throws`	`safeoutputs_cli.test.cjs`	✅ Design	—

Go: 0 (*_test.go); JavaScript: 17 (*.test.cjs). No other languages detected.

Score breakdown: behavioral 40/40 + edge-case 22.9/30 + low-duplication 15/20 + proportional-growth 0/10 = 78

⚠️ Flagged Tests — Minor Issues (1)

Duplicate cluster (safeoutputs_cli.test.cjs) — returns false when only noop, returns false when only missing_tool, and returns false when only report_incomplete share the same assertion structure (makeTempFile(single-type-JSONL) → toBe(false)) with trivially different type constants. The preceding test "returns false when all entries are diagnostic types" already covers all three together. Each individual guard is defensible — a regression removing one type from DIAGNOSTIC_SAFE_OUTPUT_TYPES would be caught by exactly one test — but consider a parameterized/table-driven approach if the diagnostic types list grows.

Test inflation — Both test files exceed the 2:1 lines-added ratio: copilot_harness.test.cjs (66 test lines / 10 prod lines = 6.6×) and safeoutputs_cli.test.cjs (133/53 = 2.5×). This is contextually expected: the two integration tests require full process scaffolding (temp dirs, stub scripts, spawnSync invocations) that naturally inflates line count relative to the compact production guard. Not actionable.

Verdict

✅ Check passed. 0% implementation tests (threshold: 30%). All 17 new tests verify observable behavior. The two integration tests exercise the full harness flow end-to-end using real process spawning and filesystem I/O — no mocks. Unit tests cover all boundary conditions for hasExpectedSafeOutputs including missing files, empty paths, malformed JSON, and dependency injection.

🧪 Test quality analysis by Test Quality Sentinel · 109.4 AIC · ⌖ 23.8 AIC · ⊞ 8.4K · ◷

github-actions

✅ Test Quality Sentinel: 78/100. Test quality is acceptable — 0% of new tests are implementation tests (threshold: 30%).

Initial plan

08f7c51

Copilot AI assigned Copilot and pelikhan Jun 26, 2026

Copilot started work on behalf of pelikhan June 26, 2026 12:06 View session

Copilot AI changed the title ~~[WIP] Fix Copilot CLI false-red on successful safe-outputs~~ fix(harness): exit 0 when expected safe-outputs already produced despite numerous permission-denied Jun 26, 2026

Copilot AI requested a review from pelikhan June 26, 2026 12:24

Copilot finished work on behalf of pelikhan June 26, 2026 12:24

pelikhan marked this pull request as ready for review June 26, 2026 12:25

Copilot AI review requested due to automatic review settings June 26, 2026 12:25

Copilot started reviewing on behalf of pelikhan June 26, 2026 12:26 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jun 26, 2026

[aw] No-Op Runs #39849

Open

github-actions Bot reviewed Jun 26, 2026

View reviewed changes

github-actions Bot approved these changes Jun 26, 2026

View reviewed changes

pelikhan merged commit a865589 into main Jun 26, 2026
53 of 61 checks passed

pelikhan deleted the copilot/aw-failures-fix-copilot-cli branch June 26, 2026 13:01

github-actions Bot mentioned this pull request Jun 26, 2026

[PR Triage Report] PR Triage Report — 2026-06-26T18:51Z [§28258543430] #41779

Closed

Uh oh!

Conversation

Copilot AI commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Skills-Based Review 🧠

Observations

Positive Highlights

Uh oh!

github-actions Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Blocking

Non-blocking (2 comments)

Uh oh!

github-actions Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 26, 2026

🧪 Test Quality Sentinel Report

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading