From: "kjtsanaktsidis (KJ Tsanaktsidis) via ruby-core" <ruby-core@...>
Date: 2023-02-13T01:01:36+00:00
Subject: [ruby-core:112390] [Ruby master Feature#19322] Support spawning "private" child processes

Issue #19322 has been updated by kjtsanaktsidis (KJ Tsanaktsidis).


> Also this new API wouldn't be adopted before a very long time by the many usages of Kernel#spawn/etc (far more than usages of Process.waitall).

This seems like a fairly compelling argument actually. You're right, I want to write some code that I can use safely in Unicorn, but there's probably a large body of existing code calling `Process#spawn` et al that people are also running inside forking servers and is "mostly working". So maybe I should have a look at solving this on the Unicorn side rather than the Ruby side.

I wonder, then, if we should aim to _deprecate_ `Process.waitall` and friends, given that it can intefere with standard library modules like open3.

> * There is Process.detach(pid). That creates one thread per pid, if that's too much overhead one could make their own with WNOHANG and sleep, then it's just 1 extra thread.
> * Alternatively, just do the Process.wait fork {} in a thread, and it's even simpler and easier to handle a child process terminating.

I think it'd be nice to avoid having to make threads just to manage subprocesses through blocking API calls if possible

> * The pipe trick, I suppose this could be used with one pipe per pid, then it's also easy to detect which process is ready to be waited on.

Could definitely work, and seems like the best solution for Unicorn specifically, all things considered.

> * Isn't one of the main points of process groups to deal with such a case? I guess we'd need to place the forks in a new process group and then we could wait on that whole process group (Process.wait -group). Maybe with an extra child process in between to setup the process group or so.

This was actually quite interesting, I had a look at this. The extra child process would need to be a sibling to the processes to be waited on (because the parent can only wait on children, not grandchildren). Then, when forking processes, each fork could call `setpgid(2)` to move itself into that sibling's process group; then, as you say, `Process.wait -group` could be used to wait for only those subprocesses that joined the group.

Unfortunately there's a race condition - a fork could crash after forking and before calling `setpgid(2)`, in which case it would never get reaped. 

> * Maybe io_uring or similar API can wait for any of multiple processes to terminate? 

Linux and FreeBSD have pidfds which can do this, and I think `WaitForMultipleObjects` can do this on Windows, but I couldn't find any equivalent for MacOS or OpenBSD unfortunately.


----------------------------------------
Feature #19322: Support spawning "private" child processes
https://bugs.ruby-lang.org/issues/19322#change-101832

* Author: kjtsanaktsidis (KJ Tsanaktsidis)
* Status: Open
* Priority: Normal
----------------------------------------
## Background

The traditional Unix process APIs (`fork` etc) are poorly isolated. If a library spawns a child process, this is not transparent to the program using the library. Any signal handler for `SIGCHLD` in the program will be called when the spawned process exits, and even worse, if the parent calls `Process.waitpid2(-1)`, it will consume the returned status code, stealing it from the library!

Unfortunately, the practice of responding to `SIGCHLD` by calling `waitpid2(-1)` in a loop is a pretty common unixism. For example, Unicorn does it [here](https://yhbt.net/unicorn.git/tree/lib/unicorn/http_server.rb#n401). In short, there is no reliable way for a gem to spawn a child process in a way that can���t (unintentionally) be interfered with by other parts of the program.

## Problem statement

Consider the following program.

```ruby
# Imagine this part of the program is in some top-level application event loop
# or something - similar to how Unicorn works. It detects child processes exiting
# and takes some action (possibly restarting a crashed worker, for example).
Signal.trap(:CHLD) do
  loop do
    begin
      pid, status = Process.waitpid2 -1
      puts "Signal handler reaped #{pid} #{status.inspect}"
    rescue Errno::ECHILD
      puts "Signal handler reaped nothing"
      break
    end
  end
end

# Imagine that _this_ part of the program is buried deep in some gem. It knows
# nothing about the application SIGCHLD handling, and quite possibly the application
# author might not even know this gem spawns a child process to do its work!
require 'open3'
loop do
  o, status = Open3.capture2("/bin/sh", "-c", "echo 'hello'")
  puts "ran command, got #{o.chomp} #{status.inspect}"
end
```

In current versions of Ruby, _some_ loop iterations will function correctly, and print something like this. The gem gets the `Process::Status` object from its command and can know if e.g. it exited abnormally.

```
ran command, got ohaithar #<Process::Status: pid 1153687 exit 0>
Signal handler reaped nothing
```

However, other iterations of the loop print this. The signal handler runs and calls `Process.waitpid2(-1)` before the code in open3 can do so. Then, the gem code does not get a `Process::Status` object! This is also potentially bad for the application; it reaped a child process it didn't even know existed, and it might cause some surprising bugs if the application author didn't know this was a possibility.

```
Signal handler reaped 1153596 #<Process::Status: pid 1153596 exit 0>
Signal handler reaped nothing
ran command, got ohaithar nil
```

We would like a family of APIs which allow a gem to spawn a child process and guarantees that the gem can wait on it. Some concurrent call to `Process.waitpid2(-1)` (or even `Process.waitpid2($some_lucky_guess_for_the_pid)`) should not steal the status out from underneath the code which created the process. Ideally, we should even suppress the `SIGCHLD` signal to avoid the application signal handler needlessly waking up.


## Proposed Ruby-level APIs.

I propose we create the following new methods in Ruby.

* `Process.spawn_private`
* `Process.fork_private`

These methods behave identically to their non-_private versions in all respect, except instead of returning a pid, they return an object of type `Process::PrivateHandle`.

`Process::PrivateHandle` would have the following methods:

* `pid()` - returns the pid for the created process
* `wait()` - blocks the caller until the created process has exited, and returns a `Process::Status` object. If the handle has _already_ had `#wait` called on it, it returns the same `Process::Status` object as was returned then immediately. This is unlike `Process.waitpid` and friends, which would raise an ECHILD in this case (or, in the face of pid wraparound, potentially wait on some other totally unrelated child process with the same pid).
* `wait_nonblock()` - if the created process has exited, behaves like `#wait`; otherwise, it returns a `Process::Status` object for which `#exited?` returns false.
* `kill(...)` - if the created process has not been reaped via a call to `#wait`, performs identically to `Process.kill ..., pid`. Otherwise, if the process _has_ been reaped, raises `Errno::ESRCH` immediately without issuing a system call. This ensures that, if pids wrap around, that the wrong process is not signaled by mistake.

A call to `Process.wait`, `Process.waitpid`, or `Process.waitpid2` will _never_ return a `Process::Status` for a process started with a `_private` method, even if that call is made with the pid of the child process. The _only_ way to reap a private child process is through `Process::PrivateHandle`.

The implementation of `IO.popen`, `Kernel#system`, `Kernel#popen`, backticks, and the `Open3` module would be changed to use this private process mechanism internally, although they do not return pids so they do not need to have their interfaces changed. (note though - I don't believe `Kernel#system` suffers from the same problem as the `open3` example above, because it does not yield the GVL nor check interrupts in between spawning the child and waiting on it)

## Implementation strategy

I believe this can be implemented, in broad strokes, with an approach like this:

* Keep a global table mapping pids -> handles for processes created with `fork_private` or `spawn_private`.
* When a child process is waited on, consult the handle table. If there is a handle registered, and the wait call was made without the handle, do NOT return the reaped status. Instead, save the status against the handle, and repeat the call to `waitpid`.
* If the wait call _was_ made with the handle, we can return the 
* Once a handle has had the child status saved against it, it is removed from the table.
* A subsequent call to wait on that pi the handle will look up the saved information and return it without making a system call.

In fact, most of the infrastructure to do this correctly is already in place - it was added by @k0kubun and @normalperson four years ago - https://bugs.ruby-lang.org/issues/14867. MJIT had a similar problem to the one described in this issue; it needs to fork a C compiler, but if the application performs a `Process.waitpid2(-1)`, it could wind up reaping the gcc process out from underneath mjit. This code has changed considerably over the course of last year, but my understanding is that mjit still uses this infrastructure to protect its Ruby child-process from becoming visible to Ruby code.

In any case, the way waitpid works _currently_, is that...

* Ruby actually does all calls to `waitpid` as `WNOHANG` (i.e. nonblocking) internally.
* If a call to `waitpid` finds no children, it blocks the thread, representing the state in a structure of type `struct waitpid_state`.
* Ruby also keeps a list of all `waitpid_state`'s that are currently being waited for, `vm->waiting_pids` and `vm->waiting_grps`.
* These structures are protected with a specific mutex, `vm->waitpid_lock`.
* Ruby internally uses the SIGCHLD signal to reap the dead children, and then find a waiting call to `waitpid` (via the two lists) to actually dispatch the reaped status to. 
* If some caller is waiting for a specific pid, that _always_ takes priority over some other caller that's waiting for a pid-group (e.g. `-1`).

mjit's child process is protected, because:

* When mjit forks, it uses a method `rb_mjit_fork` to do so.
* That calls the actual `fork` implementation _whilst still holding_ `vm->waitpid_lock`
* Before yielding the lock, it inserts an entry in `vm->waiting_pids` saying that mjit is waiting for the just-created child.
* Since direct waits for pids always take precedence over pid-groups, this ensures that mjit will always reap its own children.

I believe this mechanism can be extended and generalised to power the proposed API, and mjit could itself use that rather than having mjit-specific handling in `process.c`.

## POC implementation

I sketched out a _very_ rough POC to see if what I said above would be possible, and I think it is:

https://github.com/ruby/ruby/commit/6009c564b16862001535f2b561f1a12f6e7e0c57

The following script behaves how I expect with this patch:

```ruby
pid, h = Process.spawn_private "/bin/sh", "-c", "sleep 1; exit 69"
puts "pid -> #{pid}"
puts "h -> #{h}"

# should ESRCH.
sleep 2
begin
    Process.waitpid2 -1
rescue => e
    puts "waitpid err -> #{e}"
end
wpid, status = h.wait
puts "wpid -> #{wpid}"
puts "status -> #{status.inspect}"
```

```
ktsanaktsidis@lima-linux1 ruby % ./tool/runruby.rb -- ./tst1.rb
pid -> 1154105
h -> #<Process::PrivateHandle:0x0000ffff94014098>
waitpid err -> No child processes
wpid -> 1154105
status -> #<Process::Status: pid 1154105 exit 4>
```

The child process can be waited on with the handle, and the call to `waitpid2(-1)` finds nothing.

## Previous idea: OS-specific handles

My first version of this proposal involved a similar API, but powering it with platform-specific concepts available on Linux, Windows, and FreeBSD which offer richer control than just pids & the `wait` syscall. In particular, I had believed that we could use the `clone` syscall in Linux to create a child process which:

* Could be referred to by a unique file descriptor (a pidfd) which would be guaranteed never to be re-used (unlike a pid),
* Would not generate a signal when it exited (i.e. no SIGCHLD).
* Could not be waited on by an unspecting to `waitpid` (except if a special flag `__WCLONE` as passed).

Unfortunately, when I tried to implement this, I ran into a pretty serious snag. It is possible to create such a process - BUT, when the process exec's, it goes _back_ to "raise-SIGCHLD-on-exit" and "allow-waiting-without-__WCLONE" modes. I guess this functionality in the clone syscall is really designed to power threads in Linux, rather than being a general-purpose "hidden process" API.

So, I don't think we should use pidfds in this proposal.

## Motivation

My use-case for this is that I���m working on a perf-based profiling tool for Ruby. To get around some Linux capability issues, I want my profiler gem (or CRuby patch, whatever it winds up being!) to fork a privileged helper binary to do some eBPF twiddling. But, if you���re profiling e.g. a Unicorn master process, the result of that binary exiting might be caught by Unicorn itself, rather than my (gem | interpreter feature).

In my case, I'm so deep in linux specific stuff that just calling `clone(2)` from my extension is probably fine, but I had enough of a look at this process management stuff I thought it would be worth asking the question if this might be useful to other, more normal, gems.




-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/