Skip to content

Probe GPU device count out-of-process on Windows (#988 follow-up)#994

Merged
aittalam merged 2 commits into
mainfrom
fix-missing-cpu-fallback
Jun 5, 2026
Merged

Probe GPU device count out-of-process on Windows (#988 follow-up)#994
aittalam merged 2 commits into
mainfrom
fix-missing-cpu-fallback

Conversation

@aittalam

@aittalam aittalam commented Jun 4, 2026

Copy link
Copy Markdown
Member

Probe GPU device count out-of-process on Windows (#988 follow-up)

Background

#988 was reported as an uncaught SIGSEGV when GPU init fails, killing CPU fallback. The fix in #989 (0.10.3) wrapped the get_device_count() probe in an in-process signal guard (gpu_run_guarded): catch the fault with a handler, siglongjmp out, and treat it as "no device" so AUTO mode falls back to CPU.

That stopped the crash but introduced a new failure on Windows: the process would no longer crash — it would silently exit later, during model load (reported here). --gpu disable worked, --gpu auto did not.

Root cause

On Windows the Vulkan probe fault isn't a plain segfault — it's a C++/SEH exception thrown by the GPU driver during instance init, surfaced across the cosmo_dlopen/ms_abi boundary as SIGSEGV (the strace shows the 0xE06D7363 "msc" exception code). siglongjmp-ing out of an in-flight foreign exception skips the unwind and leaves the C++/SEH runtime and the partially-initialized driver in a corrupted state. The damage is latent and only bites later, under the memory-heavy model load, as a silent termination.

So catching the fault in-process and continuing is fundamentally unsafe here: once the foreign exception has been thrown, the process is already compromised.

Why not just "print a message and exit"?

We considered detecting the fault and exiting with a "re-run with --gpu disable" message. Live testing on a Windows VM showed that's too aggressive: a probe fault is the common, survivable no-GPU case (any machine with vulkan-1.dll present but no valid ICD — headless servers, fresh installs, VMs). Both a 0.8B and a 4.5B model loaded fine after a caught "shallow" fault. We can't distinguish a survivable catch from a corrupting one at runtime, so exiting would break working CPU fallback for that whole class of users.

This change

Run the device-count probe in a short-lived child process (a re-exec of the binary), on Windows only:

  • A __attribute__((constructor)) in gpu_backend.c (linked into every binary via gpu.a) checks two env vars; if set it cosmo_dlopens the DSO, calls the count symbol, and _Exit()s with the device count. Crash signals are converted to a clean _Exit(255) — no longjmp, no unwind, nothing to corrupt.
  • gpu_backend_probe_oop() posix_spawns the child with a private envp, then maps its exit code: 1..253 → available, 0 → no devices, 254/255/signaled → crashed. Same user-facing log messages as before.
  • cuda.c / vulkan.c dispatch IsWindows() ? gpu_backend_probe_oop(b) : gpu_backend_probe(b).

A crash now dies in the child; the parent never executes the faulting driver-init code for a device-less backend, so CPU fallback stays clean regardless of how deep the fault is. Linux/macOS keep the existing in-process guard unchanged.

Verification

  • gpu_backend_test — all 23 checks pass (in-process path untouched).
  • Child transport validated on host (missing DSO/symbol → 254, count → exit N, crash → 255).
  • Windows VM with a deliberately broken Vulkan ICD (real driver-probe crash) + CUDA hidden, Bielik-4.5B-Q8 on CPU:
    • vulkan: Vulkan crashed during device probe; trying next backend (caught in the child)
    • model loadedserver is listening/health: ok
    • POST /completion "The capital of Poland is"\boxed{Warsaw}.
    • No silent exit (NOTE that the silent exit was not replicable on our setup anyway).

Notes / trade-offs

  • Startup cost (Windows): up to 2 extra process spawns (CUDA + Vulkan probes) re-exec'ing the APE at startup with --gpu auto. Per-backend isolation is required so one backend's crash can't take down another's result.
  • On the success path (≥1 device) the parent still runs driver init when it registers/uses the backend — safe, because the child already proved init succeeds.
  • Scope is Windows-only by design; extending the OOP probe to all platforms (retiring gpu_run_guarded) is a straightforward follow-up if desired.
  • Metal is unaffected (compiled at runtime, no device-count gate).

Fixes the silent-exit follow-up to #988.

…it crashes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@aittalam

aittalam commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@aittalam

aittalam commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Tested both in a replicated scenario and on the failing setup - both work as expected.

@aittalam aittalam merged commit b1b7814 into main Jun 5, 2026
2 checks passed
@aittalam aittalam deleted the fix-missing-cpu-fallback branch June 5, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant