justincarlson | efe31a6 | 2017-05-17 00:37:25 | [diff] [blame] | 1 | # Subtle Threading Bugs and Patterns to avoid them |
| 2 | |
| 3 | [TOC] |
| 4 | |
| 5 | ## The Problem |
| 6 | We were using a number of patterns that were problematic: |
| 7 | |
| 8 | 1. Using `BrowserThread::GetMessageLoop` This isn't safe, since it could return |
| 9 | a valid pointer, but since the caller isn't holding on to a lock anymore, the |
| 10 | target MessageLoop could be destructed while it's being used. |
| 11 | |
| 12 | 1. Caching of MessageLoop pointers in order to use them later for PostTask and friends |
| 13 | |
| 14 | * This was more efficient previously (more on that later) since using |
| 15 | `BrowserThread::GetMessageLoop` involved a lock. |
| 16 | |
| 17 | * But it spread logic about the order of thread destruction all over the |
| 18 | code. Code that moved from the IO thread to the file thread and back, in |
| 19 | order to avoid doing disk access on the IO thread, ended up having to do an |
| 20 | extra hop through the UI thread on the way back to the IO thread since the |
| 21 | file thread outlives the IO thread. Of course, most code learnt this the |
| 22 | hard way, after doing the straight forward IO->file->IO thread hop and |
| 23 | updating the code after seeing reliability or user crashes. |
| 24 | |
| 25 | * It made the browser shutdown fragile and hence difficult to update. |
| 26 | |
| 27 | 1. File thread hops using RefCountedThreadSafe objects which have non-trivial |
| 28 | destructors |
| 29 | |
| 30 | * To reduce jank, frequently an object on the UI or IO thread would execute |
| 31 | some code on the file thread, then post the result back to the original |
| 32 | thread. We make this easy using `base::Callback` and |
| 33 | `RefCountedThreadSafe`, so this pattern happened often, but it's not always |
| 34 | safe: base::Callback holds an extra reference on the object to ensure that |
| 35 | it doesn't invoke a method on a deleted object. But it's quite possible |
| 36 | that before the file thread's stack unwinds and it releases the extra |
| 37 | reference, that the response task on the original thread executed and |
| 38 | released its own additional reference. The file thread is then left with |
| 39 | the remaining reference and the object gets destructed there. While for |
| 40 | most objects this is ok, many have non-trivial destructors, with the worst |
| 41 | being ones that register with the UI-thread NotificationService. Dangling |
| 42 | pointers would be left behind and tracking these crashes from ChromeBot or |
| 43 | the crash dumps has wasted several days at least for me. |
| 44 | |
| 45 | 1. Having browser code take different code paths if a thread didn't exist |
| 46 | |
| 47 | * This could be either deceptively harmless (i.e. execute synchronously when |
| 48 | it was normally asynchronous), when in fact it makes shutdown slower |
| 49 | because disk access is moved to the UI thread. |
| 50 | |
| 51 | * It could lead to data loss, if tasks are silently not posted because the |
| 52 | code assumes this only happens in unit tests, when it could occur on |
| 53 | browser shutdown as well. |
| 54 | |
| 55 | ## The Solution |
| 56 | |
| 57 | * 1+2: Where possible, use `BrowserThread::PostTask`. Everywhere else, use |
| 58 | `scoped_refptr<MessageLoopProxy>`. |
| 59 | |
| 60 | `BrowserThread::PostTask` and friends (i.e. `PostDelayedTask`, `DeleteSoon`, |
| 61 | `ReleaseSoon`) are safe and efficient: no locks are grabbed if the target |
| 62 | thread is known to outlive the current thread. The four static methods have |
| 63 | the same signature as the ones from `MessageLoop`, with the addition of the |
| 64 | first parameter to indicate the target thread: |
| 65 | |
| 66 | `BrowserThread::PostTask(BrowserThread::FILE, FROM_HERE, task);` |
| 67 | |
| 68 | Similarly, `MessageLoopProxy` has (non-static) methods with the same signature |
| 69 | as the `MessageLoop` counterparts, but is safe for caching a [reference |
| 70 | counting] pointer to. You can obtain a `MessageLoopProxy` via |
| 71 | `Thread::message_loop_proxy()`, `BrowserThread::GetMessageLoopProxy()`, or |
| 72 | `MessageLoopProxy::CreateForCurrentThread()`. |
| 73 | |
| 74 | * 3: If you want to execute a method on another thread and jump back to the |
| 75 | original thread, use `PostTaskAndReply()`: |
| 76 | |
| 77 | BrowserThread::PostTaskAndReply(BrowserThread::FILE, FROM_HERE, |
| 78 | base::Bind(&Foo::DoStuffOnFileThread, this), |
| 79 | base::Bind(&Foo::StuffDone, this)); |
| 80 | |
| 81 | `PostTaskAndReply()` will make sure that both tasks are destroyed on the |
| 82 | thread they were created on, so if they hold the last reference to the object, |
| 83 | it will be destroyed on the originating thread. You can also use |
| 84 | `PostTaskAndReplyWithResult()` to return a value from the first task and pass |
| 85 | it into the second task. |
| 86 | |
| 87 | Alternatively, if your object must be destructed on a specific thread, you can |
| 88 | use a trait from BrowserThread: |
| 89 | |
| 90 | class Foo : public base::RefCountedThreadSafe<Foo, BrowserThread::DeleteOnIOThread> |
| 91 | |
| 92 | * 4: I've removed all the special casing and always made the objects in the |
| 93 | browser code behave in one way. If you're writing a unit test and need to use |
| 94 | an object that goes to a file thread (where before it would proceed |
| 95 | synchronously), you just need: |
| 96 | |
| 97 | BrowserThread file_thread(BrowserThread::FILE, MessageLoop::current()); |
| 98 | foo->StartAsyncTaskOnFileThread(); |
| 99 | MessageLoop::current()->RunAllPending(); |
| 100 | |
| 101 | There are plenty of examples now in the tests. |
| 102 | |
| 103 | ## Gotchas |
| 104 | Even when using `BrowseThread` or `MessageLoopProxy`, you will still likely have |
| 105 | messages lost (usually resulting in memory leaks) when the target thread is in |
| 106 | the process of shutting down: the benefit over `MessageLoop` is primarily one of |
| 107 | avoiding crashing in unpredictable ways. (See this thread for debate.) |
| 108 | |
| 109 | `BrowseThread` and `MessageLoopProxy::PostTask` will silently delete a task if |
| 110 | the thread doesn't exist. This is done to avoid having all the code that uses |
| 111 | it have special cases if the target thread exists or not, and to make Valgrind |
| 112 | happy. As usual, the task for `DeleteSoon/ReleaseSoon` doesn't do anything in |
| 113 | its destructor, so this won't cause unexpected behavior with them. But be wary |
| 114 | of posted `Task` objects which have non-trivial destructors or smart pointers as |
| 115 | members. I'm still on the fence about this, since while the latter is |
| 116 | theoretical now, it could lead to problems in the future. I might change it so |
| 117 | that the tasks are not deleted when I'm ready for more Valgrind fun. |
| 118 | |
| 119 | If you absolutely must know if a task was posted or not, you can check the |
| 120 | return value of `PostTask` and friends. But note that even if the task was |
| 121 | posted successfully, there's no guarantee that it will run because the target |
| 122 | thread could already have a `QuitTask` queued up, or be in the early stages of |
| 123 | quitting. |
| 124 | |
| 125 | `g_browser->io_thread()` and `file_thread()` are still around (the former for |
| 126 | IPC code, and the latter for Linux proxy code which is in net and so can't use |
| 127 | `BrowserThread`). Don't use them unless absolutely necessary. |
| 128 | |
| 129 | |
| 130 | ### More information |
| 131 | |
| 132 | * https://bugs.chromium.org/p/chromium/issues/detail?id=25354 |
| 133 | |