first attempt at centralizing more global error logging state in ExceptionSink#6552
Conversation
stuhood
left a comment
There was a problem hiding this comment.
Thanks... I think the biggest blocker to getting that test passing will be the fact that the signal handler will not currently write to the exception log. Doing that will require a bit of refactoring.
| class ExceptionSinkError(Exception): pass | ||
|
|
||
| # TODO (in response to the above): we can avoid a single entry point and make everything work with | ||
| # great ease if we can ensure all set_* methods are idempotent! |
There was a problem hiding this comment.
I agree that making these methods idempotent would be good.
| def set_destination(cls, dir_path): | ||
| cls._destination = dir_path | ||
| cls._destination = cls._check_or_create_new_destination(dir_path) | ||
| cls._all_destinations.append(cls._destination) |
There was a problem hiding this comment.
I don't think a history of destinations is necessarily useful, so would drop this property.
There was a problem hiding this comment.
After thinking about the use case of RemotePantsRunner a bit, I think I agree, because if we're going to want to dive into any exception logging like we do there we'll probably know where to look instead of wanting to dredge up all our possible log files (?).
However, after thinking, I'm thinking now mostly about the situation which is completely unhandled now, which is that if something happens to call ExceptionSink.set_destination() after RemotePantsRunner does, we immediately lose any logs we might have wanted for the thin client's error message.
The _all_destinations bit doesn't really address that except by way of the commented-out incomplete method body which would just dredge up all the logs (not useful for the same/opposite reason), but that situation might be something we want to special-case.
There was a problem hiding this comment.
I think you addressed exactly this in your comment below re: set_pid, actually. Figuring out the log file in a signal handler part now though.
| if faulthandler.is_enabled(): | ||
| faulthandler.disable() | ||
| faulthandler.enable(trace_stream) | ||
| # This permits a non-fatal `kill -31 <pants pid>` for stacktrace retrieval. |
There was a problem hiding this comment.
This should likely be written to both the current trace_stream and the current exception log(s).
Doing the latter thing would require keeping an open file handle for the current exception log(s), which is a bit harder... but I think that getting the test I added to pass will require this, because killing the process with a signal will only result in this firing, rather than also logging an exception.
Additionally, (re)-opening the exception log for our current pid (on a best effort basis...) would likely require an explicit step, like set_pid: we'd call it as soon as we know we've forked.
There was a problem hiding this comment.
The reason I didn't consider this such a huge change when we talked was I think because the biggest change I see is just "having an actual signal handler for SIGUSR2 instead of using faulthandler", and the code I threw away was already doing that in order to handle SIGINT and SIGQUIT. I'm not sure now whether the handling of those signals should be done in this PR. We currently set handlers for those signals in RemotePantsRunner to send a signal to the pantsd-runner process on control-c.
There was a problem hiding this comment.
Well, the biggest change is figuring out where to write in a signal handler as described in your comment, but after that's figured out.
There was a problem hiding this comment.
having an actual signal handler for SIGUSR2 instead of using faulthandler
If you haven't already, take a look at the restrictions on signal handlers: https://docs.python.org/3/library/faulthandler.html ...
The fault handler is called on catastrophic cases and therefore can only use signal-safe functions (e.g. it cannot allocate memory on the heap). Because of this limitation traceback dumping is minimal compared to normal Python tracebacks:
There was a problem hiding this comment.
That's extremely helpful! As an aside, I have used the glibc docs on signal handlers before and found the bit on re-entrancy and signal handlers to be super illuminating to understand what's allowed in a signal handler and why and have found it applicable to other languages. I will peruse the python docs you linked because I was looking for that exactly and literally have a tab open to that page but just was only looking for method names and didn't read it. Thanks!
| safe_mkdir(destination) | ||
| except Exception as e: | ||
| # NB: When this class sets up excepthooks, raising this should be safe, because we always | ||
| # have a destination to log to (os.getcwd() if not otherwise set). |
There was a problem hiding this comment.
Now that this field defaults to None this comment isn't accurate.
But regardless: I think that if this is refactored to send the signal-handler output to the exception log, it won't matter, because we'll need to eagerly open the exception/signal log in order to register it with faulthandler.
There was a problem hiding this comment.
Now that this field defaults to None this comment isn't accurate.
Will update.
But regardless: I think that if this is refactored to send the signal-handler output to the exception log, it won't matter, because we'll need to eagerly open the exception/signal log in order to register it with faulthandler.
I'm understanding this as: we need to handle "opening the exception log in a signal/exception handler" anyway so in doing so we don't have to worry about the validity of the exception log in other methods (because it'd be handled specifically somewhere). Which is cool.
There was a problem hiding this comment.
"opening the exception log in a signal/exception handler" anyway
See the faulthandler docs linked above.
| # TODO: this is messy. Is it less surprising to have the Exiter modified during construction, or | ||
| # in the run() method we do here? | ||
| ExceptionSink.set_destination(self._global_options.pants_workdir) | ||
| self._exiter.should_print_backtrace = self._global_options.print_exception_stacktrace |
There was a problem hiding this comment.
Question I do not know the answer to: should the should_print_backtrace property be on the ExceptionSink instead? It applies to the current errstream, rather than to the Exiter, per-se.
There was a problem hiding this comment.
I moved this around many times, including letting the traceback formatter be None to denote not printing tracebacks. On master, should_print_backtrace applies to the exception message printed to stderr on exit only, not to the log message, so this code conforms to the current status, which also seems to make most sense to me (I was going to delete it, but obviously this is required for the specific case of exiting with error). We take an Exiter as a param in order to avoid subsuming exiting as well as fatal logging functionality. The current errstream is an implementation detail, and btw, also not used yet, I need to make ExceptionSink call its Exiter with the out arg provided.
I don't think should_print_backtrace should be a property of the errstream or the ExceptionSink as opposed to the Exiter unless there are more situations besides "we are printing an error message on an unexpected error, possibly to a terminal" that we would want should_print_backtrace to apply to (which I'm all ears for, but don't see right now).
Note to self to make ExceptionSink use the out arg when exiting.
| try: | ||
| PantsRunner(exiter, start_time=start_time).run() | ||
| except KeyboardInterrupt: | ||
| # TODO: ExceptionSink could probably take care of this too, as well as converting SIGINT into |
There was a problem hiding this comment.
Not sure about that one. Where possible, I feel like treating the Exiter as a high level concept associated with a particular run, and ExceptionSink as the "always available manager of exception logs (and maybe an err stream if it's available)" is a cleanish separation.
With regard to handling keyboard interrupts: those are pretty clearly a per-run concept, and so if possible, their Handling should be part of the Exiter.
There was a problem hiding this comment.
Hm, ok, defining it like that I think is better than what I had in my head, which was "ExceptionSink makes sure your logs are logged" (subsuming trapping as well as logging -- the first half of "ExceptionSink"). If we can move global state e.g. changing signal handlers to the Exiter, or to its file, I think that's a great move.
88df4d1 to
fd183ec
Compare
| if log_location.pid is None: | ||
| raise cls.ExceptionSinkError( | ||
| "No-op call to reset_log_location(): argument was {!r}." | ||
| .format(log_location)) |
There was a problem hiding this comment.
I found this message confusing. I don't think this is a noop. Maybe it should be "log_location must specify either a pid or a directory, but both were None"
Alternatively, you could change LogLocation#__new__ so that it throws when both are None. That would make it easier to trace where the LogLocation in question came from.
There was a problem hiding this comment.
The difficulty here is that I'm overloading LogLocation (it used to be called LogLocationRequest) for both setting state and getting state, and those should have different semantics, like you describe. I think the clear answer is to split the two out (one for exceptions_log_path and one for reset_log_location).
| # Ensure the directory is suitable for writing to, or raise. | ||
| cls._check_or_create_new_destination(log_location.log_dir) | ||
| if log_location.pid is None: | ||
| log_location = log_location.copy(pid=os.getpid()) |
There was a problem hiding this comment.
You could also default the pid in this way in LogLocation#__new__
There was a problem hiding this comment.
Will do as part of above.
20effeb to
a2cf6d7
Compare
| if exception_text: | ||
| yield exception_text | ||
|
|
||
| def broadcast_signal_to_client(self, signum): |
There was a problem hiding this comment.
IMO, we should refactor out all changes to Ctrl+C handling, because that sends signals from the nailgun client to the nailgun server. It should be easy to add later now that we have the pid and pgrp.
There was a problem hiding this comment.
Reverted the changes to this file.
| # process that the stacktrace output lands deterministically in a known place vs to a stray | ||
| # terminal window. | ||
| self._exiter.set_except_hook(sys.stderr) | ||
| ExceptionSink.reset_log_location(LogLocation(os.getpid(), os.getcwd())) |
There was a problem hiding this comment.
We should keep the location we have before this call, because otherwise we will switch away from the configured location to the cwd here.
We do need to set the pid though, because we just finished forking, and definitely have a new pid.
There was a problem hiding this comment.
I have removed the ability to do anything about the pid at all. This is still broken for a totally different reason but won't be in a bit.
| return super(GetLogLocationRequest, cls).__new__(cls, log_dir=log_dir, pid=pid) | ||
|
|
||
|
|
||
| class LogLocation(datatype([ |
There was a problem hiding this comment.
I think this datatype might be overkill for this case, especially since these properties can be change independently.
There was a problem hiding this comment.
This datatype is gone.
ity
left a comment
There was a problem hiding this comment.
lets address these and will re-review
| raise KeyboardInterrupt('User interrupted execution with control-c!') | ||
| # TODO: determine the appropriate signal-safe behavior here (to avoid writing to our file | ||
| # descriptors re-entrantly, which raises an IOError). This is a best-effort solution. | ||
| if cls._is_handling_signal: |
There was a problem hiding this comment.
a docstring here to explain what signals this is actually handling, would be great.
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class RunTrackingExiter(Exiter): |
There was a problem hiding this comment.
lets extract RunTrackingExiter changes out of local_pants_runner since its unrelated to this change and to keep this PR limited to pants_daemon_runner changes.
There was a problem hiding this comment.
The changes from this file have now also been reverted entirely!
| options_bootstrapper = OptionsBootstrapper(env=self._env, args=self._args) | ||
| bootstrap_options = options_bootstrapper.get_bootstrap_options() | ||
| ExceptionSink.set_destination(bootstrap_options.for_global_scope().pants_workdir) | ||
| ExceptionSink.reset_log_location(LogLocation.from_options_for_current_process( |
There was a problem hiding this comment.
as we discussed, separating out pid and log location between daemon_runner and pants_runner is
There was a problem hiding this comment.
This has been changed to just update the location and the exiter from the bootstrap options.
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class RemoteExiter(Exiter): |
There was a problem hiding this comment.
this may not be required at all, since server (pantsd) never sends anything to the client
There was a problem hiding this comment.
Indeed! This file has been reverted entirely!
| logger.fatal('lost active connection to pantsd!') | ||
| raise_with_traceback(self._extract_remote_exception(pantsd_handle.pid, e)) | ||
| error_log_msg = '\nlost active connection to pantsd!' | ||
| ExceptionSink.log_exception(error_log_msg) |
There was a problem hiding this comment.
this wont be required at all, since you are already handling this earlier
There was a problem hiding this comment.
This file has been reverted entirely!
|
|
||
| def run(self, args=None): | ||
| self._setup_logging() | ||
| # Redirect fatal error logging to the current workdir, set the stream to log stacktraces to on |
There was a problem hiding this comment.
all of this can be reverted since we are removing RemoteExiteretc from this change.
There was a problem hiding this comment.
This file has been reverted entirely!
| def _emit(self, s): | ||
| """Append content to the main report file.""" | ||
| if os.path.exists(self._html_dir): # Make sure we're not immediately after a clean-all. | ||
| report_file_probably_writable = ( |
There was a problem hiding this comment.
we can remove this too, as we are removing any RunTracker changes from this.
| def _trapped_signals(self, client): | ||
| """A contextmanager that overrides the SIGINT (control-c) and SIGQUIT (control-\) handlers | ||
| and handles them remotely.""" | ||
| # TODO: if this control-c is done at the wrong time (before the pantsd-runner process begins and |
There was a problem hiding this comment.
lets extract all the ctrl-c changes from this PR into a separate one.
There was a problem hiding this comment.
This file has been reverted entirely!
eddcea0 to
e2ac5a2
Compare
…ptionSink - make it handle SIGUSR2 and unhandled exceptions so we can move that logic out of Exiter
ity
left a comment
There was a problem hiding this comment.
nice work! this is looking great to me - will wait for your changes to +1.
| print(bold(cyan('\ncompleted in {} seconds'.format(elapsed)))) | ||
|
|
||
| # TODO: uncomment this and add an issue link! | ||
| run_count = self._run_count(workdir) |
898e34c to
f70e7c0
Compare
stuhood
left a comment
There was a problem hiding this comment.
Thanks, this looks good. It would be great to continue to (lightly) refine the relationship between the Exiter and the ExceptionSink to make their individual responsibilities clearer... but it looks logically correct from my end.
| except KeyboardInterrupt: | ||
| po.send_signal(signal.SIGINT) | ||
| # The process may still have exited, even if we were interrupted. | ||
| safe_kill(po.pid, signal.SIGINT) |
There was a problem hiding this comment.
You can leave this in this PR, but a reminder: please do not conflate independent changes in one PR like this in the future. If this line causes a bug, or if the rest of the PR contains a bug, they won't be able to be cherry-picked independently. And it also makes reviewing more challenging.
There was a problem hiding this comment.
This was necessary to pass testing. It may not be necessary now with the removal of the signal handlers.
There was a problem hiding this comment.
Again, I'm fine with leaving it... just a note for future PRs.
There was a problem hiding this comment.
If it's required to pass testing because of the no longer existing changes in signal handling, it would point to signal handling (correctly) being an instance of changing multiple things at once (along with the logging changes in this PR), which I agree is made very clear here. I don't understand why this specific change is something to avoid if we needed it to pass CI?
| # Create the directory if possible, or raise if not writable. | ||
| cls._check_or_create_new_destination(new_log_location) | ||
| else: | ||
| # Interpret a None argument to mean to just refresh the file handles for the current pid. |
There was a problem hiding this comment.
This method is never called with None, afaict, so would be good to remove this case to improve clarity.
There was a problem hiding this comment.
Yes, after talking with you last week about where we need to reset the logs (whenever we get bootstrap options or whenever we change pid) and we can remove that now.
| This is where the the error message on exit will be printed to as well. | ||
| """ | ||
| # NB: mutate process-global state! | ||
| # TODO: chain=True will log tracebacks to previous values of the trace stream as well -- what |
There was a problem hiding this comment.
The faulthandler docs indicate that you can call unregisteror disable to remove previous copies of handlers. Is that worth doing here?
There was a problem hiding this comment.
At first I had it doing that, but ended up just copying the current code (hence the TODO, because writing to maybe-closed maybe-unrelated streams after explicitly changing it sounds like the wrong thing to do). I think it would make more sense to unregister and disable here.
| out=cls._interactive_output_stream) | ||
|
|
||
| @classmethod | ||
| def _log_unhandled_exception_and_exit(cls, exc_class=None, exc=None, tb=None, add_newline=False): |
There was a problem hiding this comment.
This method is called from outside of the class, and should maybe be public.
It's also worth considering whether this should (mostly) stay in the Exiter, and just call back to the ExceptionSink for the purposes of writing to the exception logs...? I'm not very clear on that though, given the fact that they both call one another at various times.
There was a problem hiding this comment.
I think I like that idea a lot. I think it specifically should be done, along with the comment on the Exiter holding onto its output stream, in a followup. I will continue addressing your comments and seeing what can be done here without lots more code changes.
| faulthandler.register(signal.SIGUSR2, interactive_output_stream, chain=True) | ||
|
|
||
| # NB: mutate the class variables! | ||
| # We don't *necessarily* need to keep a reference to this, but we do here for clarity. |
There was a problem hiding this comment.
This field is used in _exit_with_failure, so this comment isn't accurate. As commented in _log_unhandled_exception_and_exit, maybe the Exiter should just hold onto the stream it wants to use to render a failure to the user, and the ExceptionSink should just handle the exception logs?
There was a problem hiding this comment.
That sounds like a fantastic idea.
| self.purge_metadata() | ||
| self.pre_fork(**pre_fork_opts or {}) | ||
| is_parent, is_child = fork_func() | ||
| # No arguments reuses the same log location, but with the new pid after forking. |
There was a problem hiding this comment.
I think this comment is in the wrong place?
| out = out or sys.stderr | ||
| if not is_fileobj_definitely_closed(out): | ||
| print(msg, file=out) | ||
| # NB: Ensure we write everything out in case it's not an unbuffered stream like stderr. |
There was a problem hiding this comment.
I think that f.close() implicitly flushes.
There was a problem hiding this comment.
If we close stderr here we may see an error when invoking pants as a pex because the pex executor will call sys.stderr.flush() directly (haven't checked yet, but that was the error we saw in our internal CI). pex writing directly to stderr is something that I believe could be configurable and which would be extremely useful. maybe. i'll make a followup ticket.
There was a problem hiding this comment.
As in, it calls that in the body of a finally: after the pex exits, and then raises an exception if stderr is closed.
There was a problem hiding this comment.
The code is now all the same, but wrapped in a try/catch which calls logger.exception(e). If it is almost definitely better to close it here, I can do it -- just not sure if there's a reason to do it over the flush alone. I've left a detailed TODO above the flush() call about this.
|
|
||
|
|
||
| def is_fileobj_definitely_closed(fileobj): | ||
| """Return whether fileobj points to a file object which is definitely closed. |
There was a problem hiding this comment.
This method is a bit too context-specific to put in util I think.
There was a problem hiding this comment.
Agreed, it's not that pythonic either.
| self.nailgun_client.send_control_c() | ||
| mock_kill.assert_not_called() | ||
|
|
||
| @mock.patch('os.kill', **PATCH_OPTS) |
There was a problem hiding this comment.
This test seems useful... would be good to preserve if possible. If it was broken beforehand, leaving it in as xfail would be good, because there is a race between getting a pid from the server and Ctrl+C arriving at the client that means it's important to be able to noop.
There was a problem hiding this comment.
That race is erased in #6579 with the pid/pgrp being required before the rest of the communication continues, btw. I have added these back, fixed the mock so they pass, and added an additional test for the new maybe_send_signal() method.
| self.assertRegexpMatches(waiter_run.stderr_data, """\ | ||
| Signal {signum} was raised\\. Exiting with failure\\. | ||
| \\(backtrace omitted\\) | ||
| """.format(pid=parent_runner_pid, signum=signal.SIGTERM)) |
There was a problem hiding this comment.
The pid isn't used here, but that's probably fine: telling someone the dead pid is less necessary than rendering the error.
There was a problem hiding this comment.
This is an artifact of a previous implementation which logged the pid with a timestamp to stderr, which is now only supposed to occur when writing to a log file, which is why this error stayed hidden. I think it can be removed.
There was a problem hiding this comment.
The pid has been removed here, replaced with a note about why we only need to check the stderr.
This reverts commit f70e7c0.
|
This is the only remaining failure -- this is what the |
this fixes the keyboardinterrupt integration test
### Problem This PR contains several deemed-necessary-by-me-personally fixes to fatal error logging which weren't put into #6552 because that diff was already massive. This should be considered the same PR as that one. ### Solution - Default the `logs/` directory to be `.pants.d/logs` so we don't litter the buildroot with a top-level `logs/` directory. - Add a timestamp to any fatal errors printed to the terminal to more easily debug race conditions. - Add the process title (from `setproctitle`) to the fatal error log message (this makes it more clear when the process is pantsd, compared to the `sys.argv`, which are arbitrary). - Remove any mention of exception stacktraces from Exiter. - Add `ExceptionSink.reset_should_print_backtrace()` (with a new mutable field `_should_print_backtrace`) to determine whether to print stacktraces to the terminal on exit. - Testing for all of the above.
…ptionSink (#6552) See #6530. This is bullet point 2 of that issue. - rename `ExceptionSink#set_destination()` to `reset_log_location()`. - (re)?acquire open file handles and persist them as `ExceptionSink` private fields so we can write to them in signal handlers. - use `faulthandler.enable()` to log fatal signals (`SIGABRT`, `SIGSEGV`, ...) and stacktraces to the process-specific error stream from #6539. - add `reset_exiter()` and delegate to the exiter to exit on an unhandled exception. - add `reset_interactive_output_stream()` to change which stream the Exiter is provided as the `out` argument, and to enable stacktraces to be sent to that stream when the process receives SIGUSR2 (on par with current functionality) - move everything besides exiting out of the Exiter class We can un-xfail [the integration test](https://github.com/pantsbuild/pants/blob/f70e7c0b7d55a389d31b9b3b706d0901bc9f655e/tests/python/pants_test/pantsd/test_pantsd_integration.py#L416) introduced in #6539. This makes it significantly easier to diagnose pantsd communication errors and their causes, both locally and in CI.
### Problem This PR contains several deemed-necessary-by-me-personally fixes to fatal error logging which weren't put into #6552 because that diff was already massive. This should be considered the same PR as that one. ### Solution - Default the `logs/` directory to be `.pants.d/logs` so we don't litter the buildroot with a top-level `logs/` directory. - Add a timestamp to any fatal errors printed to the terminal to more easily debug race conditions. - Add the process title (from `setproctitle`) to the fatal error log message (this makes it more clear when the process is pantsd, compared to the `sys.argv`, which are arbitrary). - Remove any mention of exception stacktraces from Exiter. - Add `ExceptionSink.reset_should_print_backtrace()` (with a new mutable field `_should_print_backtrace`) to determine whether to print stacktraces to the terminal on exit. - Testing for all of the above.
### Problem #6552 has some changes around signal handling. This was an attempt to move those out, and to gain greater visibility into and control over when pants errors out due to a signal (helping to address difficulties in flaky tests around exiting, such as #6708). Resolves #7014, resolves #6708, resolves #6847, resolves #7199. ### Solution - Create a very simple `SignalHandler` class which can be subclassed to set pants's behavior upon receiving different signals in different execution contexts. - Add a `_signal_handler` field to `ExceptionSink`. - Apply `SignalHandler` everywhere that tries to modify pants's exception handling. - Add the `--pantsd-pailgun-quit-timeout` bootstrap option and implement a timeout strategy to continue to forward output from the remote `pantsd-runner` process for a brief period of time upon receiving a signal, then killing the remote process if it's still alive at the end. - Unskip some flaky tests that we hopefully won't have to reskip. ### Result It's much more clear when and where signal handlers are set, and the flow of control is made more clear. It is easier to extend signal handlers in a hygienic way, allowing for the resolution of multiple flaky tests.
Problem
See #6530. This is bullet point 2 of that issue.
Solution
ExceptionSink#set_destination()toreset_log_location().ExceptionSinkprivate fields so we can write to them in signal handlers.faulthandler.enable()to log fatal signals (SIGABRT,SIGSEGV, ...) and stacktraces to the process-specific error stream from pantsd client logs exceptions from server processes #6539.reset_exiter()and delegate to the exiter to exit on an unhandled exception.reset_interactive_output_stream()to change which stream the Exiter is provided as theoutargument, and to enable stacktraces to be sent to that stream when the process receives SIGUSR2 (on par with current functionality)Result
We can un-xfail the integration test introduced in #6539. This makes it significantly easier to diagnose pantsd communication errors and their causes, both locally and in CI.