Skip to content

first attempt at centralizing more global error logging state in ExceptionSink#6552

Merged
cosmicexplorer merged 69 commits into
pantsbuild:masterfrom
cosmicexplorer:catch-everything
Oct 9, 2018
Merged

first attempt at centralizing more global error logging state in ExceptionSink#6552
cosmicexplorer merged 69 commits into
pantsbuild:masterfrom
cosmicexplorer:catch-everything

Conversation

@cosmicexplorer

@cosmicexplorer cosmicexplorer commented Sep 24, 2018

Copy link
Copy Markdown
Contributor

Problem

See #6530. This is bullet point 2 of that issue.

Solution

  • rename ExceptionSink#set_destination() to reset_log_location().
  • (re)?acquire open file handles and persist them as ExceptionSink private fields so we can write to them in signal handlers.
  • add reset_exiter() and delegate to the exiter to exit on an unhandled exception.
  • add reset_interactive_output_stream() to change which stream the Exiter is provided as the out argument, and to enable stacktraces to be sent to that stream when the process receives SIGUSR2 (on par with current functionality)
  • move everything besides exiting out of the Exiter class

Result

We can un-xfail the integration test introduced in #6539. This makes it significantly easier to diagnose pantsd communication errors and their causes, both locally and in CI.

@stuhood stuhood left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks... I think the biggest blocker to getting that test passing will be the fact that the signal handler will not currently write to the exception log. Doing that will require a bit of refactoring.

Comment thread src/python/pants/base/exception_sink.py Outdated
class ExceptionSinkError(Exception): pass

# TODO (in response to the above): we can avoid a single entry point and make everything work with
# great ease if we can ensure all set_* methods are idempotent!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that making these methods idempotent would be good.

Comment thread src/python/pants/base/exception_sink.py Outdated
def set_destination(cls, dir_path):
cls._destination = dir_path
cls._destination = cls._check_or_create_new_destination(dir_path)
cls._all_destinations.append(cls._destination)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a history of destinations is necessarily useful, so would drop this property.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about the use case of RemotePantsRunner a bit, I think I agree, because if we're going to want to dive into any exception logging like we do there we'll probably know where to look instead of wanting to dredge up all our possible log files (?).

However, after thinking, I'm thinking now mostly about the situation which is completely unhandled now, which is that if something happens to call ExceptionSink.set_destination() after RemotePantsRunner does, we immediately lose any logs we might have wanted for the thin client's error message.

The _all_destinations bit doesn't really address that except by way of the commented-out incomplete method body which would just dredge up all the logs (not useful for the same/opposite reason), but that situation might be something we want to special-case.

@cosmicexplorer cosmicexplorer Sep 25, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you addressed exactly this in your comment below re: set_pid, actually. Figuring out the log file in a signal handler part now though.

Comment thread src/python/pants/base/exception_sink.py Outdated
if faulthandler.is_enabled():
faulthandler.disable()
faulthandler.enable(trace_stream)
# This permits a non-fatal `kill -31 <pants pid>` for stacktrace retrieval.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should likely be written to both the current trace_stream and the current exception log(s).

Doing the latter thing would require keeping an open file handle for the current exception log(s), which is a bit harder... but I think that getting the test I added to pass will require this, because killing the process with a signal will only result in this firing, rather than also logging an exception.

Additionally, (re)-opening the exception log for our current pid (on a best effort basis...) would likely require an explicit step, like set_pid: we'd call it as soon as we know we've forked.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't consider this such a huge change when we talked was I think because the biggest change I see is just "having an actual signal handler for SIGUSR2 instead of using faulthandler", and the code I threw away was already doing that in order to handle SIGINT and SIGQUIT. I'm not sure now whether the handling of those signals should be done in this PR. We currently set handlers for those signals in RemotePantsRunner to send a signal to the pantsd-runner process on control-c.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the biggest change is figuring out where to write in a signal handler as described in your comment, but after that's figured out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having an actual signal handler for SIGUSR2 instead of using faulthandler

If you haven't already, take a look at the restrictions on signal handlers: https://docs.python.org/3/library/faulthandler.html ...

The fault handler is called on catastrophic cases and therefore can only use signal-safe functions (e.g. it cannot allocate memory on the heap). Because of this limitation traceback dumping is minimal compared to normal Python tracebacks:

@cosmicexplorer cosmicexplorer Sep 25, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's extremely helpful! As an aside, I have used the glibc docs on signal handlers before and found the bit on re-entrancy and signal handlers to be super illuminating to understand what's allowed in a signal handler and why and have found it applicable to other languages. I will peruse the python docs you linked because I was looking for that exactly and literally have a tab open to that page but just was only looking for method names and didn't read it. Thanks!

Comment thread src/python/pants/base/exception_sink.py Outdated
safe_mkdir(destination)
except Exception as e:
# NB: When this class sets up excepthooks, raising this should be safe, because we always
# have a destination to log to (os.getcwd() if not otherwise set).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this field defaults to None this comment isn't accurate.

But regardless: I think that if this is refactored to send the signal-handler output to the exception log, it won't matter, because we'll need to eagerly open the exception/signal log in order to register it with faulthandler.

@cosmicexplorer cosmicexplorer Sep 25, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this field defaults to None this comment isn't accurate.

Will update.

But regardless: I think that if this is refactored to send the signal-handler output to the exception log, it won't matter, because we'll need to eagerly open the exception/signal log in order to register it with faulthandler.

I'm understanding this as: we need to handle "opening the exception log in a signal/exception handler" anyway so in doing so we don't have to worry about the validity of the exception log in other methods (because it'd be handled specifically somewhere). Which is cool.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"opening the exception log in a signal/exception handler" anyway

See the faulthandler docs linked above.

# TODO: this is messy. Is it less surprising to have the Exiter modified during construction, or
# in the run() method we do here?
ExceptionSink.set_destination(self._global_options.pants_workdir)
self._exiter.should_print_backtrace = self._global_options.print_exception_stacktrace

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question I do not know the answer to: should the should_print_backtrace property be on the ExceptionSink instead? It applies to the current errstream, rather than to the Exiter, per-se.

@cosmicexplorer cosmicexplorer Sep 25, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this around many times, including letting the traceback formatter be None to denote not printing tracebacks. On master, should_print_backtrace applies to the exception message printed to stderr on exit only, not to the log message, so this code conforms to the current status, which also seems to make most sense to me (I was going to delete it, but obviously this is required for the specific case of exiting with error). We take an Exiter as a param in order to avoid subsuming exiting as well as fatal logging functionality. The current errstream is an implementation detail, and btw, also not used yet, I need to make ExceptionSink call its Exiter with the out arg provided.

I don't think should_print_backtrace should be a property of the errstream or the ExceptionSink as opposed to the Exiter unless there are more situations besides "we are printing an error message on an unexpected error, possibly to a terminal" that we would want should_print_backtrace to apply to (which I'm all ears for, but don't see right now).

Note to self to make ExceptionSink use the out arg when exiting.

Comment thread src/python/pants/bin/pants_exe.py Outdated
try:
PantsRunner(exiter, start_time=start_time).run()
except KeyboardInterrupt:
# TODO: ExceptionSink could probably take care of this too, as well as converting SIGINT into

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about that one. Where possible, I feel like treating the Exiter as a high level concept associated with a particular run, and ExceptionSink as the "always available manager of exception logs (and maybe an err stream if it's available)" is a cleanish separation.

With regard to handling keyboard interrupts: those are pretty clearly a per-run concept, and so if possible, their Handling should be part of the Exiter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, ok, defining it like that I think is better than what I had in my head, which was "ExceptionSink makes sure your logs are logged" (subsuming trapping as well as logging -- the first half of "ExceptionSink"). If we can move global state e.g. changing signal handlers to the Exiter, or to its file, I think that's a great move.

Comment thread src/python/pants/base/exception_sink.py Outdated
if log_location.pid is None:
raise cls.ExceptionSinkError(
"No-op call to reset_log_location(): argument was {!r}."
.format(log_location))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this message confusing. I don't think this is a noop. Maybe it should be "log_location must specify either a pid or a directory, but both were None"

Alternatively, you could change LogLocation#__new__ so that it throws when both are None. That would make it easier to trace where the LogLocation in question came from.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difficulty here is that I'm overloading LogLocation (it used to be called LogLocationRequest) for both setting state and getting state, and those should have different semantics, like you describe. I think the clear answer is to split the two out (one for exceptions_log_path and one for reset_log_location).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread src/python/pants/base/exception_sink.py Outdated
# Ensure the directory is suitable for writing to, or raise.
cls._check_or_create_new_destination(log_location.log_dir)
if log_location.pid is None:
log_location = log_location.copy(pid=os.getpid())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also default the pid in this way in LogLocation#__new__

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do as part of above.

@cosmicexplorer cosmicexplorer force-pushed the catch-everything branch 5 times, most recently from 20effeb to a2cf6d7 Compare September 28, 2018 16:27
if exception_text:
yield exception_text

def broadcast_signal_to_client(self, signum):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should refactor out all changes to Ctrl+C handling, because that sends signals from the nailgun client to the nailgun server. It should be easy to add later now that we have the pid and pgrp.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the changes to this file.

# process that the stacktrace output lands deterministically in a known place vs to a stray
# terminal window.
self._exiter.set_except_hook(sys.stderr)
ExceptionSink.reset_log_location(LogLocation(os.getpid(), os.getcwd()))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the location we have before this call, because otherwise we will switch away from the configured location to the cwd here.

We do need to set the pid though, because we just finished forking, and definitely have a new pid.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the ability to do anything about the pid at all. This is still broken for a totally different reason but won't be in a bit.

Comment thread src/python/pants/base/exception_sink.py Outdated
return super(GetLogLocationRequest, cls).__new__(cls, log_dir=log_dir, pid=pid)


class LogLocation(datatype([

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this datatype might be overkill for this case, especially since these properties can be change independently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This datatype is gone.

@ity ity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets address these and will re-review

Comment thread src/python/pants/base/exception_sink.py Outdated
raise KeyboardInterrupt('User interrupted execution with control-c!')
# TODO: determine the appropriate signal-safe behavior here (to avoid writing to our file
# descriptors re-entrantly, which raises an IOError). This is a best-effort solution.
if cls._is_handling_signal:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a docstring here to explain what signals this is actually handling, would be great.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done!)

logger = logging.getLogger(__name__)


class RunTrackingExiter(Exiter):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets extract RunTrackingExiter changes out of local_pants_runner since its unrelated to this change and to keep this PR limited to pants_daemon_runner changes.

@cosmicexplorer cosmicexplorer Sep 29, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from this file have now also been reverted entirely!

Comment thread src/python/pants/bin/pants_runner.py Outdated
options_bootstrapper = OptionsBootstrapper(env=self._env, args=self._args)
bootstrap_options = options_bootstrapper.get_bootstrap_options()
ExceptionSink.set_destination(bootstrap_options.for_global_scope().pants_workdir)
ExceptionSink.reset_log_location(LogLocation.from_options_for_current_process(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed, separating out pid and log location between daemon_runner and pants_runner is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been changed to just update the location and the exiter from the bootstrap options.

logger = logging.getLogger(__name__)


class RemoteExiter(Exiter):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may not be required at all, since server (pantsd) never sends anything to the client

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! This file has been reverted entirely!

logger.fatal('lost active connection to pantsd!')
raise_with_traceback(self._extract_remote_exception(pantsd_handle.pid, e))
error_log_msg = '\nlost active connection to pantsd!'
ExceptionSink.log_exception(error_log_msg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wont be required at all, since you are already handling this earlier

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been reverted entirely!


def run(self, args=None):
self._setup_logging()
# Redirect fatal error logging to the current workdir, set the stream to log stacktraces to on

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this can be reverted since we are removing RemoteExiteretc from this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been reverted entirely!

def _emit(self, s):
"""Append content to the main report file."""
if os.path.exists(self._html_dir): # Make sure we're not immediately after a clean-all.
report_file_probably_writable = (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this too, as we are removing any RunTracker changes from this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted!

def _trapped_signals(self, client):
"""A contextmanager that overrides the SIGINT (control-c) and SIGQUIT (control-\) handlers
and handles them remotely."""
# TODO: if this control-c is done at the wrong time (before the pantsd-runner process begins and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets extract all the ctrl-c changes from this PR into a separate one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been reverted entirely!

@cosmicexplorer cosmicexplorer force-pushed the catch-everything branch 3 times, most recently from eddcea0 to e2ac5a2 Compare September 30, 2018 02:03

@ity ity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! this is looking great to me - will wait for your changes to +1.

print(bold(cyan('\ncompleted in {} seconds'.format(elapsed))))

# TODO: uncomment this and add an issue link!
run_count = self._run_count(workdir)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the culprit ;)

@cosmicexplorer

cosmicexplorer commented Oct 2, 2018

Copy link
Copy Markdown
Contributor Author

Just tried cherry-picking, and remembered we'll probably need #6533 and #6539 first, trying that now (EDIT: that worked, so this should be fine to hotfix onto 1.10.x).

@stuhood stuhood left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good. It would be great to continue to (lightly) refine the relationship between the Exiter and the ExceptionSink to make their individual responsibilities clearer... but it looks logically correct from my end.

except KeyboardInterrupt:
po.send_signal(signal.SIGINT)
# The process may still have exited, even if we were interrupted.
safe_kill(po.pid, signal.SIGINT)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave this in this PR, but a reminder: please do not conflate independent changes in one PR like this in the future. If this line causes a bug, or if the rest of the PR contains a bug, they won't be able to be cherry-picked independently. And it also makes reviewing more challenging.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was necessary to pass testing. It may not be necessary now with the removal of the signal handlers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm fine with leaving it... just a note for future PRs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's required to pass testing because of the no longer existing changes in signal handling, it would point to signal handling (correctly) being an instance of changing multiple things at once (along with the logging changes in this PR), which I agree is made very clear here. I don't understand why this specific change is something to avoid if we needed it to pass CI?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted!

Comment thread src/python/pants/base/exception_sink.py Outdated
# Create the directory if possible, or raise if not writable.
cls._check_or_create_new_destination(new_log_location)
else:
# Interpret a None argument to mean to just refresh the file handles for the current pid.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is never called with None, afaict, so would be good to remove this case to improve clarity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after talking with you last week about where we need to reset the logs (whenever we get bootstrap options or whenever we change pid) and we can remove that now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

Comment thread src/python/pants/base/exception_sink.py Outdated
This is where the the error message on exit will be printed to as well.
"""
# NB: mutate process-global state!
# TODO: chain=True will log tracebacks to previous values of the trace stream as well -- what

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The faulthandler docs indicate that you can call unregisteror disable to remove previous copies of handlers. Is that worth doing here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I had it doing that, but ended up just copying the current code (hence the TODO, because writing to maybe-closed maybe-unrelated streams after explicitly changing it sounds like the wrong thing to do). I think it would make more sense to unregister and disable here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

out=cls._interactive_output_stream)

@classmethod
def _log_unhandled_exception_and_exit(cls, exc_class=None, exc=None, tb=None, add_newline=False):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is called from outside of the class, and should maybe be public.

It's also worth considering whether this should (mostly) stay in the Exiter, and just call back to the ExceptionSink for the purposes of writing to the exception logs...? I'm not very clear on that though, given the fact that they both call one another at various times.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like that idea a lot. I think it specifically should be done, along with the comment on the Exiter holding onto its output stream, in a followup. I will continue addressing your comments and seeing what can be done here without lots more code changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broke out #6596!

faulthandler.register(signal.SIGUSR2, interactive_output_stream, chain=True)

# NB: mutate the class variables!
# We don't *necessarily* need to keep a reference to this, but we do here for clarity.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is used in _exit_with_failure, so this comment isn't accurate. As commented in _log_unhandled_exception_and_exit, maybe the Exiter should just hold onto the stream it wants to use to render a failure to the user, and the ExceptionSink should just handle the exception logs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a fantastic idea.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broke out #6596!

self.purge_metadata()
self.pre_fork(**pre_fork_opts or {})
is_parent, is_child = fork_func()
# No arguments reuses the same log location, but with the new pid after forking.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is in the wrong place?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted!

Comment thread src/python/pants/base/exiter.py Outdated
out = out or sys.stderr
if not is_fileobj_definitely_closed(out):
print(msg, file=out)
# NB: Ensure we write everything out in case it's not an unbuffered stream like stderr.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that f.close() implicitly flushes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we close stderr here we may see an error when invoking pants as a pex because the pex executor will call sys.stderr.flush() directly (haven't checked yet, but that was the error we saw in our internal CI). pex writing directly to stderr is something that I believe could be configurable and which would be extremely useful. maybe. i'll make a followup ticket.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, it calls that in the body of a finally: after the pex exits, and then raises an exception if stderr is closed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is now all the same, but wrapped in a try/catch which calls logger.exception(e). If it is almost definitely better to close it here, I can do it -- just not sure if there's a reason to do it over the flush alone. I've left a detailed TODO above the flush() call about this.

Comment thread src/python/pants/util/fileutil.py Outdated


def is_fileobj_definitely_closed(fileobj):
"""Return whether fileobj points to a file object which is definitely closed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is a bit too context-specific to put in util I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's not that pythonic either.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted!

self.nailgun_client.send_control_c()
mock_kill.assert_not_called()

@mock.patch('os.kill', **PATCH_OPTS)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems useful... would be good to preserve if possible. If it was broken beforehand, leaving it in as xfail would be good, because there is a race between getting a pid from the server and Ctrl+C arriving at the client that means it's important to be able to noop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That race is erased in #6579 with the pid/pgrp being required before the rest of the communication continues, btw. I have added these back, fixed the mock so they pass, and added an additional test for the new maybe_send_signal() method.

self.assertRegexpMatches(waiter_run.stderr_data, """\
Signal {signum} was raised\\. Exiting with failure\\.
\\(backtrace omitted\\)
""".format(pid=parent_runner_pid, signum=signal.SIGTERM))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pid isn't used here, but that's probably fine: telling someone the dead pid is less necessary than rendering the error.

@cosmicexplorer cosmicexplorer Oct 4, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an artifact of a previous implementation which logged the pid with a timestamp to stderr, which is now only supposed to occur when writing to a log file, which is why this error stayed hidden. I think it can be removed.

@cosmicexplorer cosmicexplorer Oct 5, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pid has been removed here, replaced with a note about why we only need to check the stderr.

@cosmicexplorer cosmicexplorer changed the title [WIP] first attempt at centralizing more global error logging state in ExceptionSink first attempt at centralizing more global error logging state in ExceptionSink Oct 4, 2018

@stuhood stuhood left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last commit re-xfails the test that this was supposed to be fixing. We need that test to be passing here. Let's discuss when you're available.

@stuhood stuhood left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Danny.

@cosmicexplorer

Copy link
Copy Markdown
Contributor Author

This is the only remaining failure -- this is what the safe_kill() call was for earlier. I have a list of quick followups (separate from #6596, which doesn't need to happen immediately) but given the size of this PR I'm just going to fix the failure, land this, and then make the followup.

this fixes the keyboardinterrupt integration test
@cosmicexplorer cosmicexplorer merged commit 295ea60 into pantsbuild:master Oct 9, 2018
cosmicexplorer added a commit that referenced this pull request Oct 10, 2018
### Problem

This PR contains several deemed-necessary-by-me-personally fixes to fatal error logging which weren't put into #6552 because that diff was already massive. This should be considered the same PR as that one.

### Solution

- Default the `logs/` directory to be `.pants.d/logs` so we don't litter the buildroot with a top-level `logs/` directory.
- Add a timestamp to any fatal errors printed to the terminal to more easily debug race conditions.
- Add the process title (from `setproctitle`) to the fatal error log message (this makes it more clear when the process is pantsd, compared to the `sys.argv`, which are arbitrary).
- Remove any mention of exception stacktraces from Exiter. 
- Add `ExceptionSink.reset_should_print_backtrace()` (with a new mutable field `_should_print_backtrace`) to determine whether to print stacktraces to the terminal on exit.
- Testing for all of the above.
cosmicexplorer added a commit that referenced this pull request Oct 10, 2018
…ptionSink (#6552)

See #6530. This is bullet point 2 of that issue.

- rename `ExceptionSink#set_destination()` to `reset_log_location()`.
- (re)?acquire open file handles and persist them as `ExceptionSink` private fields so we can write to them in signal handlers.
  - use `faulthandler.enable()` to log fatal signals (`SIGABRT`, `SIGSEGV`, ...) and stacktraces to the process-specific error stream from #6539.
- add `reset_exiter()` and delegate to the exiter to exit on an unhandled exception.
- add `reset_interactive_output_stream()` to change which stream the Exiter is provided as the `out` argument, and to enable stacktraces to be sent to that stream when the process receives SIGUSR2 (on par with current functionality)
- move everything besides exiting out of the Exiter class

We can un-xfail [the integration test](https://github.com/pantsbuild/pants/blob/f70e7c0b7d55a389d31b9b3b706d0901bc9f655e/tests/python/pants_test/pantsd/test_pantsd_integration.py#L416) introduced in #6539. This makes it significantly easier to diagnose pantsd communication errors and their causes, both locally and in CI.
cosmicexplorer added a commit that referenced this pull request Oct 10, 2018
### Problem

This PR contains several deemed-necessary-by-me-personally fixes to fatal error logging which weren't put into #6552 because that diff was already massive. This should be considered the same PR as that one.

### Solution

- Default the `logs/` directory to be `.pants.d/logs` so we don't litter the buildroot with a top-level `logs/` directory.
- Add a timestamp to any fatal errors printed to the terminal to more easily debug race conditions.
- Add the process title (from `setproctitle`) to the fatal error log message (this makes it more clear when the process is pantsd, compared to the `sys.argv`, which are arbitrary).
- Remove any mention of exception stacktraces from Exiter. 
- Add `ExceptionSink.reset_should_print_backtrace()` (with a new mutable field `_should_print_backtrace`) to determine whether to print stacktraces to the terminal on exit.
- Testing for all of the above.
cosmicexplorer added a commit that referenced this pull request Mar 22, 2019
### Problem

#6552 has some changes around signal handling. This was an attempt to move those out, and to gain greater visibility into and control over when pants errors out due to a signal (helping to address difficulties in flaky tests around exiting, such as #6708).

Resolves #7014, resolves #6708, resolves #6847, resolves #7199.

### Solution

- Create a very simple `SignalHandler` class which can be subclassed to set pants's behavior upon receiving different signals in different execution contexts.
- Add a `_signal_handler` field to `ExceptionSink`.
- Apply `SignalHandler` everywhere that tries to modify pants's exception handling.
- Add the `--pantsd-pailgun-quit-timeout` bootstrap option and implement a timeout strategy to continue to forward output from the remote `pantsd-runner` process for a brief period of time upon receiving a signal, then killing the remote process if it's still alive at the end.
- Unskip some flaky tests that we hopefully won't have to reskip.

### Result

It's much more clear when and where signal handlers are set, and the flow of control is made more clear. It is easier to extend signal handlers in a hygienic way, allowing for the resolution of multiple flaky tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants