Probe GPU device count out-of-process on Windows (#988 follow-up)#994
Merged
Conversation
…it crashes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Member
Author
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Member
Author
|
Tested both in a replicated scenario and on the failing setup - both work as expected. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Probe GPU device count out-of-process on Windows (#988 follow-up)
Background
#988 was reported as an uncaught
SIGSEGVwhen GPU init fails, killing CPU fallback. The fix in #989 (0.10.3) wrapped theget_device_count()probe in an in-process signal guard (gpu_run_guarded): catch the fault with a handler,siglongjmpout, and treat it as "no device" so AUTO mode falls back to CPU.That stopped the crash but introduced a new failure on Windows: the process would no longer crash — it would silently exit later, during model load (reported here).
--gpu disableworked,--gpu autodid not.Root cause
On Windows the Vulkan probe fault isn't a plain segfault — it's a C++/SEH exception thrown by the GPU driver during instance init, surfaced across the
cosmo_dlopen/ms_abiboundary asSIGSEGV(the strace shows the0xE06D7363"msc" exception code).siglongjmp-ing out of an in-flight foreign exception skips the unwind and leaves the C++/SEH runtime and the partially-initialized driver in a corrupted state. The damage is latent and only bites later, under the memory-heavy model load, as a silent termination.So catching the fault in-process and continuing is fundamentally unsafe here: once the foreign exception has been thrown, the process is already compromised.
Why not just "print a message and exit"?
We considered detecting the fault and exiting with a "re-run with
--gpu disable" message. Live testing on a Windows VM showed that's too aggressive: a probe fault is the common, survivable no-GPU case (any machine withvulkan-1.dllpresent but no valid ICD — headless servers, fresh installs, VMs). Both a 0.8B and a 4.5B model loaded fine after a caught "shallow" fault. We can't distinguish a survivable catch from a corrupting one at runtime, so exiting would break working CPU fallback for that whole class of users.This change
Run the device-count probe in a short-lived child process (a re-exec of the binary), on Windows only:
__attribute__((constructor))ingpu_backend.c(linked into every binary viagpu.a) checks two env vars; if set itcosmo_dlopens the DSO, calls the count symbol, and_Exit()s with the device count. Crash signals are converted to a clean_Exit(255)— nolongjmp, no unwind, nothing to corrupt.gpu_backend_probe_oop()posix_spawns the child with a private envp, then maps its exit code:1..253→ available,0→ no devices,254/255/signaled → crashed. Same user-facing log messages as before.cuda.c/vulkan.cdispatchIsWindows() ? gpu_backend_probe_oop(b) : gpu_backend_probe(b).A crash now dies in the child; the parent never executes the faulting driver-init code for a device-less backend, so CPU fallback stays clean regardless of how deep the fault is. Linux/macOS keep the existing in-process guard unchanged.
Verification
gpu_backend_test— all 23 checks pass (in-process path untouched).vulkan: Vulkan crashed during device probe; trying next backend(caught in the child)model loaded→server is listening→/health: okPOST /completion "The capital of Poland is"→\boxed{Warsaw}.Notes / trade-offs
--gpu auto. Per-backend isolation is required so one backend's crash can't take down another's result.gpu_run_guarded) is a straightforward follow-up if desired.Fixes the silent-exit follow-up to #988.