Skip to content

fix(host-sweep): grace period for freshly-woken containers with stale processing claims#2736

Merged
gavrielc merged 1 commit into
mainfrom
fix/host-sweep-wake-grace
Jun 11, 2026
Merged

fix(host-sweep): grace period for freshly-woken containers with stale processing claims#2736
gavrielc merged 1 commit into
mainfrom
fix/host-sweep-wake-grace

Conversation

@gavrielc

Copy link
Copy Markdown
Collaborator

Type of Change

  • Feature skill - adds a channel or integration (source code changes + SKILL.md)
  • Utility skill - adds a standalone tool (code files in .claude/skills/<name>/, no source changes)
  • Operational/container skill - adds a workflow or agent skill (SKILL.md only, no source changes)
  • Fix - bug fix or security fix to source code
  • Simplification - reduces or simplifies source code
  • Documentation - docs, README, or CONTRIBUTING changes only

Description

What: The host-sweep tick that wakes a container for due messages no longer runs the running-container SLA check in that same iteration. A justWoke flag carries from the wake step (sweep step 2) into the SLA gate (step 3).

Why: A container that crashes mid-turn leaves stale processing_ack rows in outbound.db. The agent-runner clears those on startup (clearStaleProcessingAcks), but the sweep's wake and SLA check ran back-to-back in one tick: wake spawns a fresh container, isContainerRunning now reports alive, the per-claim stuck rule sees the hours-old claim with no heartbeat movement, and the host SIGKILLs the container it spawned milliseconds earlier — a spawn-kill loop that pins the session. This is adjacent to, but distinct from, the orphan-claim cleanup in resetStuckProcessingRows: that path only fires after the host itself kills a container, so claims left by a container that died on its own (OOM, crash) still hit the wake-tick race.

How: Three small edits inside sweepSession (src/host-sweep.ts): declare let justWoke = false; after the due-message count, set it after await wakeContainer(session), and gate the SLA check with && !justWoke. The grace lasts exactly one tick — the next sweep (60s later) enforces the ceiling and per-claim rules normally, so a genuinely stuck container is still killed.

Tested: New guard test src/host-sweep-grace.test.ts drives the real sweep loop against a real central DB and on-disk session DBs (only the container runner is mocked), seeding a due message plus a 2h-old processing claim: the wake tick must not kill, and a later tick with the claim still stale must kill claim-stuck. Removing the !justWoke gate turns the test red. Full validation: 350 host tests pass (pnpm test), build/typecheck/lint at baseline, 101 container tests pass (bun test).

🤖 Generated with Claude Code

… processing claims

The sweep tick that wakes a container for due messages also ran the
running-container SLA check in the same iteration. A fresh container that
inherits stale processing_ack rows from a previous crash hasn't had a chance
to run its startup cleanup (clearStaleProcessingAcks) yet, so the per-claim
stuck rule saw an hours-old claim, concluded the just-spawned container was
stuck, and SIGKILL'd it — an immediate spawn-kill loop.

Carry a justWoke flag from the wake step into the SLA gate and skip the
check for that one tick. The next tick (60s later) enforces the SLA
normally, so a genuinely stuck container is still killed.

Guarded by src/host-sweep-grace.test.ts, which drives two real sweep ticks
against on-disk session DBs: the wake tick must not kill, a later tick with
the claim still stale must kill claim-stuck.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@gavrielc gavrielc requested a review from gabi-simons as a code owner June 11, 2026 10:56
@github-actions github-actions Bot added follows-guidelines PR was created using the current contributing template PR: Fix Bug fix labels Jun 11, 2026
@gavrielc gavrielc merged commit 83951d7 into main Jun 11, 2026
2 checks passed
@gavrielc gavrielc deleted the fix/host-sweep-wake-grace branch June 11, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

follows-guidelines PR was created using the current contributing template PR: Fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant