Blame - docs/subtle_threading_bugs.md - chromium/src

blob: 5ab7cf92765e4fea364718b494359f93f0d0f7dc [file] [log] [blame] [view]

justincarlson	efe31a6	2017-05-17 00:37:25	[diff] [blame]	1	# Subtle Threading Bugs and Patterns to avoid them
				2
				3	[TOC]
				4
				5	## The Problem
				6	We were using a number of patterns that were problematic:
				7
				8	1. Using `BrowserThread::GetMessageLoop` This isn't safe, since it could return
				9	a valid pointer, but since the caller isn't holding on to a lock anymore, the
				10	target MessageLoop could be destructed while it's being used.
				11
				12	1. Caching of MessageLoop pointers in order to use them later for PostTask and friends
				13
				14	* This was more efficient previously (more on that later) since using
				15	`BrowserThread::GetMessageLoop` involved a lock.
				16
				17	* But it spread logic about the order of thread destruction all over the
				18	code. Code that moved from the IO thread to the file thread and back, in
				19	order to avoid doing disk access on the IO thread, ended up having to do an
				20	extra hop through the UI thread on the way back to the IO thread since the
				21	file thread outlives the IO thread. Of course, most code learnt this the
				22	hard way, after doing the straight forward IO->file->IO thread hop and
				23	updating the code after seeing reliability or user crashes.
				24
				25	* It made the browser shutdown fragile and hence difficult to update.
				26
				27	1. File thread hops using RefCountedThreadSafe objects which have non-trivial
				28	destructors
				29
				30	* To reduce jank, frequently an object on the UI or IO thread would execute
				31	some code on the file thread, then post the result back to the original
				32	thread. We make this easy using `base::Callback` and
				33	`RefCountedThreadSafe`, so this pattern happened often, but it's not always
				34	safe: base::Callback holds an extra reference on the object to ensure that
				35	it doesn't invoke a method on a deleted object. But it's quite possible
				36	that before the file thread's stack unwinds and it releases the extra
				37	reference, that the response task on the original thread executed and
				38	released its own additional reference. The file thread is then left with
				39	the remaining reference and the object gets destructed there. While for
				40	most objects this is ok, many have non-trivial destructors, with the worst
				41	being ones that register with the UI-thread NotificationService. Dangling
				42	pointers would be left behind and tracking these crashes from ChromeBot or
				43	the crash dumps has wasted several days at least for me.
				44
				45	1. Having browser code take different code paths if a thread didn't exist
				46
				47	* This could be either deceptively harmless (i.e. execute synchronously when
				48	it was normally asynchronous), when in fact it makes shutdown slower
				49	because disk access is moved to the UI thread.
				50
				51	* It could lead to data loss, if tasks are silently not posted because the
				52	code assumes this only happens in unit tests, when it could occur on
				53	browser shutdown as well.
				54
				55	## The Solution
				56
				57	* 1+2: Where possible, use `BrowserThread::PostTask`. Everywhere else, use
				58	`scoped_refptr<MessageLoopProxy>`.
				59
				60	`BrowserThread::PostTask` and friends (i.e. `PostDelayedTask`, `DeleteSoon`,
				61	`ReleaseSoon`) are safe and efficient: no locks are grabbed if the target
				62	thread is known to outlive the current thread. The four static methods have
				63	the same signature as the ones from `MessageLoop`, with the addition of the
				64	first parameter to indicate the target thread:
				65
				66	`BrowserThread::PostTask(BrowserThread::FILE, FROM_HERE, task);`
				67
				68	Similarly, `MessageLoopProxy` has (non-static) methods with the same signature
				69	as the `MessageLoop` counterparts, but is safe for caching a [reference
				70	counting] pointer to. You can obtain a `MessageLoopProxy` via
				71	`Thread::message_loop_proxy()`, `BrowserThread::GetMessageLoopProxy()`, or
				72	`MessageLoopProxy::CreateForCurrentThread()`.
				73
				74	* 3: If you want to execute a method on another thread and jump back to the
				75	original thread, use `PostTaskAndReply()`:
				76
				77	BrowserThread::PostTaskAndReply(BrowserThread::FILE, FROM_HERE,
				78	base::Bind(&Foo::DoStuffOnFileThread, this),
				79	base::Bind(&Foo::StuffDone, this));
				80
				81	`PostTaskAndReply()` will make sure that both tasks are destroyed on the
				82	thread they were created on, so if they hold the last reference to the object,
				83	it will be destroyed on the originating thread. You can also use
				84	`PostTaskAndReplyWithResult()` to return a value from the first task and pass
				85	it into the second task.
				86
				87	Alternatively, if your object must be destructed on a specific thread, you can
				88	use a trait from BrowserThread:
				89
				90	class Foo : public base::RefCountedThreadSafe<Foo, BrowserThread::DeleteOnIOThread>
				91
				92	* 4: I've removed all the special casing and always made the objects in the
				93	browser code behave in one way. If you're writing a unit test and need to use
				94	an object that goes to a file thread (where before it would proceed
				95	synchronously), you just need:
				96
				97	BrowserThread file_thread(BrowserThread::FILE, MessageLoop::current());
				98	foo->StartAsyncTaskOnFileThread();
				99	MessageLoop::current()->RunAllPending();
				100
				101	There are plenty of examples now in the tests.
				102
				103	## Gotchas
				104	Even when using `BrowseThread` or `MessageLoopProxy`, you will still likely have
				105	messages lost (usually resulting in memory leaks) when the target thread is in
				106	the process of shutting down: the benefit over `MessageLoop` is primarily one of
				107	avoiding crashing in unpredictable ways. (See this thread for debate.)
				108
				109	`BrowseThread` and `MessageLoopProxy::PostTask` will silently delete a task if
				110	the thread doesn't exist. This is done to avoid having all the code that uses
				111	it have special cases if the target thread exists or not, and to make Valgrind
				112	happy. As usual, the task for `DeleteSoon/ReleaseSoon` doesn't do anything in
				113	its destructor, so this won't cause unexpected behavior with them. But be wary
				114	of posted `Task` objects which have non-trivial destructors or smart pointers as
				115	members. I'm still on the fence about this, since while the latter is
				116	theoretical now, it could lead to problems in the future. I might change it so
				117	that the tasks are not deleted when I'm ready for more Valgrind fun.
				118
				119	If you absolutely must know if a task was posted or not, you can check the
				120	return value of `PostTask` and friends. But note that even if the task was
				121	posted successfully, there's no guarantee that it will run because the target
				122	thread could already have a `QuitTask` queued up, or be in the early stages of
				123	quitting.
				124
				125	`g_browser->io_thread()` and `file_thread()` are still around (the former for
				126	IPC code, and the latter for Linux proxy code which is in net and so can't use
				127	`BrowserThread`). Don't use them unless absolutely necessary.
				128
				129
				130	### More information
				131
				132	* https://bugs.chromium.org/p/chromium/issues/detail?id=25354
				133