summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* workqueue: re-add lockdep dependencies for flushingJohannes Berg2018-08-221-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In flush_work(), we need to create a lockdep dependency so that the following scenario is appropriately tagged as a problem: work_function() { mutex_lock(&mutex); ... } other_function() { mutex_lock(&mutex); flush_work(&work); // or cancel_work_sync(&work); } This is a problem since the work might be running and be blocked on trying to acquire the mutex. Similarly, in flush_workqueue(). These were removed after cross-release partially caught these problems, but now cross-release was reverted anyway. IMHO the removal was erroneous anyway though, since lockdep should be able to catch potential problems, not just actual ones, and cross-release would only have caught the problem when actually invoking wait_for_completion(). Fixes: fd1a5b04dfb8 ("workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: skip lockdep wq dependency in cancel_work_sync()Johannes Berg2018-08-221-15/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In cancel_work_sync(), we can only have one of two cases, even with an ordered workqueue: * the work isn't running, just cancelled before it started * the work is running, but then nothing else can be on the workqueue before it Thus, we need to skip the lockdep workqueue dependency handling, otherwise we get false positive reports from lockdep saying that we have a potential deadlock when the workqueue also has other work items with locking, e.g. work1_function() { mutex_lock(&mutex); ... } work2_function() { /* nothing */ } other_function() { queue_work(ordered_wq, &work1); queue_work(ordered_wq, &work2); mutex_lock(&mutex); cancel_work_sync(&work2); } As described above, this isn't a problem, but lockdep will currently flag it as if cancel_work_sync() was flush_work(), which *is* a problem. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: move function definitions within CONFIG_SMP blockMathieu Malaterre2018-05-231-2/+2
| | | | | | | | | | | | | | | | | In commit 7ee681b25284 ("workqueue: Convert to state machine callbacks"), three new function definitions were added: ‘workqueue_prepare_cpu’, ‘workqueue_online_cpu’ and ‘workqueue_offline_cpu’. Move these function definitions within a CONFIG_SMP block since they are not used outside of it. This will match function declarations in header <include/linux/workqueue.h>, and silence the following gcc warning (W=1): kernel/workqueue.c:4743:5: warning: no previous prototype for ‘workqueue_prepare_cpu’ [-Wmissing-prototypes] kernel/workqueue.c:4756:5: warning: no previous prototype for ‘workqueue_online_cpu’ [-Wmissing-prototypes] kernel/workqueue.c:4783:5: warning: no previous prototype for ‘workqueue_offline_cpu’ [-Wmissing-prototypes] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: Make sure struct worker is accessible for wq_worker_comm()Tejun Heo2018-05-211-24/+34
| | | | | | | | | | | | The worker struct could already be freed when wq_worker_comm() tries to access it for reporting. This patch protects PF_WQ_WORKER modifications with wq_pool_attach_mutex and makes wq_worker_comm() test the flag before dereferencing worker from kthread_data(), which ensures that it only dereferences when the worker struct is valid. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Lai Jiangshan <jiangshanlai@gmail.com> Fixes: 6b59808bfe48 ("workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}")
* workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}Tejun Heo2018-05-183-2/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There can be a lot of workqueue workers and they all show up with the cryptic kworker/* names making it difficult to understand which is doing what and how they came to be. # ps -ef | grep kworker root 4 2 0 Feb25 ? 00:00:00 [kworker/0:0H] root 6 2 0 Feb25 ? 00:00:00 [kworker/u112:0] root 19 2 0 Feb25 ? 00:00:00 [kworker/1:0H] root 25 2 0 Feb25 ? 00:00:00 [kworker/2:0H] root 31 2 0 Feb25 ? 00:00:00 [kworker/3:0H] ... This patch makes workqueue workers report the latest workqueue it was executing for through /proc/PID/{comm,stat,status}. The extra information is appended to the kthread name with intervening '+' if currently executing, otherwise '-'. # cat /proc/25/comm kworker/2:0-events_power_efficient # cat /proc/25/stat 25 (kworker/2:0-events_power_efficient) I 2 0 0 0 -1 69238880 0 0... # grep Name /proc/25/status Name: kworker/2:0-events_power_efficient Unfortunately, ps(1) truncates comm to 15 characters, # ps 25 PID TTY STAT TIME COMMAND 25 ? I 0:00 [kworker/2:0-eve] making it a lot less useful; however, this should be an easy fix from ps(1) side. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Craig Small <csmall@enc.com.au>
* proc: Consolidate task->comm formatting into proc_task_name()Tejun Heo2018-05-183-14/+19
| | | | | | | | | | | proc shows task->comm in three places - comm, stat, status - and each is fetching and formatting task->comm slighly differently. This patch renames task_name() to proc_task_name(), makes it more generic, and updates all three paths to use it. This will enable expanding comm reporting for workqueue workers. Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: Set worker->desc to workqueue name by defaultTejun Heo2018-05-182-12/+10
| | | | | | | | | | | | | | | | | Work functions can use set_worker_desc() to improve the visibility of what the worker task is doing. Currently, the desc field is unset at the beginning of each execution and there is a separate field to track the field is set during the current execution. Instead of leaving empty till desc is set, worker->desc can be used to remember the last workqueue the worker worked on by default and users that use set_worker_desc() can override it to something more informative as necessary. This simplifies desc handling and helps tracking the last workqueue that the worker exected on to improve visibility. Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: Make worker_attach/detach_pool() update worker->poolTejun Heo2018-05-182-9/+9
| | | | | | | | | | | | | | | For historical reasons, the worker attach/detach functions don't currently manage worker->pool and the callers are manually and inconsistently updating it. This patch moves worker->pool updates into the worker attach/detach functions. This makes worker->pool consistent and clearly defines how worker->pool updates are synchronized. This will help later workqueue visibility improvements by allowing safe access to workqueue information from worker->task. Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutexTejun Heo2018-05-181-21/+20
| | | | | | | | | | | | | | To improve workqueue visibility, we want to be able to access workqueue information from worker tasks. The per-pool attach mutex makes that difficult because there's no way of stabilizing task -> worker pool association without knowing the pool first. Worker attach/detach is a slow path and there's no need for different pools to be able to perform them concurrently. This patch replaces the per-pool attach_mutex with global wq_pool_attach_mutex to prepare for visibility improvement changes. Signed-off-by: Tejun Heo <tj@kernel.org>
* Merge tag 'trace-v4.17-rc4-2' of ↵Linus Torvalds2018-05-173-22/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Some of the ftrace internal events use a zero for a data size of a field event. This is increasingly important for the histogram trigger work that is being extended. While auditing trace events, I found that a couple of the xen events were used as just marking that a function was called, by creating a static array of size zero. This can play havoc with the tracing features if these events are used, because a zero size of a static array is denoted as a special nul terminated dynamic array (this is what the trace_marker code uses). But since the xen events have no size, they are not nul terminated, and unexpected results may occur. As trace events were never intended on being a marker to denote that a function was hit or not, especially since function tracing and kprobes can trivially do the same, the best course of action is to simply remove these events" * tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/x86/xen: Remove zero data size trace events trace_xen_mmu_flush_tlb{_all}
| * tracing/x86/xen: Remove zero data size trace events ↵Steven Rostedt (VMware)2018-05-143-22/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | trace_xen_mmu_flush_tlb{_all} Doing an audit of trace events, I discovered two trace events in the xen subsystem that use a hack to create zero data size trace events. This is not what trace events are for. Trace events add memory footprint overhead, and if all you need to do is see if a function is hit or not, simply make that function noinline and use function tracer filtering. Worse yet, the hack used was: __array(char, x, 0) Which creates a static string of zero in length. There's assumptions about such constructs in ftrace that this is a dynamic string that is nul terminated. This is not the case with these tracepoints and can cause problems in various parts of ftrace. Nuke the trace events! Link: http://lkml.kernel.org/r/20180509144605.5a220327@gandalf.local.home Cc: stable@vger.kernel.org Fixes: 95a7d76897c1e ("xen/mmu: Use Xen specific TLB flush instead of the generic one.") Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
* | Merge tag 'trace-v4.17-rc5-vsprintf' of ↵Linus Torvalds2018-05-161-11/+15
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull memory barrier for from Steven Rostedt: "The memory barrier usage in updating the random ptr hash for %p in vsprintf is incorrect. Instead of adding the read memory barrier into vsprintf() which will cause a slight degradation to a commonly used function in the kernel just to solve a very unlikely race condition that can only happen at boot up, change the code from using a variable branch to a static_branch. Not only does this solve the race condition, it actually will improve the performance of vsprintf() by removing the conditional branch that is only needed at boot" * tag 'trace-v4.17-rc5-vsprintf' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: vsprintf: Replace memory barrier with static_key for random_ptr_key update
| * | vsprintf: Replace memory barrier with static_key for random_ptr_key updateSteven Rostedt (VMware)2018-05-161-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewing Tobin's patches for getting pointers out early before entropy has been established, I noticed that there's a lone smp_mb() in the code. As with most lone memory barriers, this one appears to be incorrectly used. We currently basically have this: get_random_bytes(&ptr_key, sizeof(ptr_key)); /* * have_filled_random_ptr_key==true is dependent on get_random_bytes(). * ptr_to_id() needs to see have_filled_random_ptr_key==true * after get_random_bytes() returns. */ smp_mb(); WRITE_ONCE(have_filled_random_ptr_key, true); And later we have: if (unlikely(!have_filled_random_ptr_key)) return string(buf, end, "(ptrval)", spec); /* Missing memory barrier here. */ hashval = (unsigned long)siphash_1u64((u64)ptr, &ptr_key); As the CPU can perform speculative loads, we could have a situation with the following: CPU0 CPU1 ---- ---- load ptr_key = 0 store ptr_key = random smp_mb() store have_filled_random_ptr_key load have_filled_random_ptr_key = true BAD BAD BAD! (you're so bad!) Because nothing prevents CPU1 from loading ptr_key before loading have_filled_random_ptr_key. But this race is very unlikely, but we can't keep an incorrect smp_mb() in place. Instead, replace the have_filled_random_ptr_key with a static_branch not_filled_random_ptr_key, that is initialized to true and changed to false when we get enough entropy. If the update happens in early boot, the static_key is updated immediately, otherwise it will have to wait till entropy is filled and this happens in an interrupt handler which can't enable a static_key, as that requires a preemptible context. In that case, a work_queue is used to enable it, as entropy already took too long to establish in the first place waiting a little more shouldn't hurt anything. The benefit of using the static key is that the unlikely branch in vsprintf() now becomes a nop. Link: http://lkml.kernel.org/r/20180515100558.21df515e@gandalf.local.home Cc: stable@vger.kernel.org Fixes: ad67b74d2469d ("printk: hash addresses printed with %p") Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
* | | Merge tag 'afs-fixes-20180514' of ↵Linus Torvalds2018-05-1517-168/+266
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: "Here's a set of patches that fix a number of bugs in the in-kernel AFS client, including: - Fix directory locking to not use individual page locks for directory reading/scanning but rather to use a semaphore on the afs_vnode struct as the directory contents must be read in a single blob and data from different reads must not be mixed as the entire contents may be shuffled about between reads. - Fix address list parsing to handle port specifiers correctly. - Only give up callback records on a server if we actually talked to that server (we might not be able to access a server). - Fix some callback handling bugs, including refcounting, whole-volume callbacks and when callbacks actually get broken in response to a CB.CallBack op. - Fix some server/address rotation bugs, including giving up if we can't probe a server; giving up if a server says it doesn't have a volume, but there are more servers to try. - Fix the decoding of fetched statuses to be OpenAFS compatible. - Fix the handling of server lookups in Cache Manager ops (such as CB.InitCallBackState3) to use a UUID if possible and to handle no server being found. - Fix a bug in server lookup where not all addresses are compared. - Fix the non-encryption of calls that prevents some servers from being accessed (this also requires an AF_RXRPC patch that has already gone in through the net tree). There's also a patch that adds tracepoints to log Cache Manager ops that don't find a matching server, either by UUID or by address" * tag 'afs-fixes-20180514' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix the non-encryption of calls afs: Fix CB.CallBack handling afs: Fix whole-volume callback handling afs: Fix afs_find_server search loop afs: Fix the handling of an unfound server in CM operations afs: Add a tracepoint to record callbacks from unlisted servers afs: Fix the handling of CB.InitCallBackState3 to find the server by UUID afs: Fix VNOVOL handling in address rotation afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibility afs: Fix server rotation's handling of fileserver probe failure afs: Fix refcounting in callback registration afs: Fix giving up callbacks on server destruction afs: Fix address list parsing afs: Fix directory page locking
| * | | afs: Fix the non-encryption of callsDavid Howells2018-05-141-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some AFS servers refuse to accept unencrypted traffic, so can't be accessed with kAFS. Set the AF_RXRPC security level to encrypt client calls to deal with this. Note that incoming service calls are set by the remote client and so aren't affected by this. This requires an AF_RXRPC patch to pass the value set by setsockopt to calls begun by the kernel. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix CB.CallBack handlingDavid Howells2018-05-141-28/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The handling of CB.CallBack messages sent by the fileserver to the client is broken in that they are currently being processed after the reply has been transmitted. This is not what the fileserver expects, however. It holds up change visibility until the reply comes so as to maintain cache coherency, and so expects the client to have to refetch the state on the affected files. Fix CB.CallBack handling to perform the callback break before sending the reply. The fileserver is free to hold up status fetches issued by other threads on the same client that occur in reponse to the callback until any pending changes have been committed. Fixes: d001648ec7cf ("rxrpc: Don't expose skbs to in-kernel users [ver #2]") Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix whole-volume callback handlingDavid Howells2018-05-1410-32/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible for an AFS file server to issue a whole-volume notification that callbacks on all the vnodes in the file have been broken. This is done for R/O and backup volumes (which don't have per-file callbacks) and for things like a volume being taken offline. Fix callback handling to detect whole-volume notifications, to track it across operations and to check it during inode validation. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix afs_find_server search loopMarc Dionne2018-05-141-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code that looks up servers by addresses makes the assumption that the list of addresses for a server is sorted. It exits the loop if it finds that the target address is larger than the current candidate. As the list is not currently sorted, this can lead to a failure to find a matching server, which can cause callbacks from that server to be ignored. Remove the early exit case so that the complete list is searched. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix the handling of an unfound server in CM operationsDavid Howells2018-05-142-27/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the client cache manager operations that need the server record (CB.Callback, CB.InitCallBackState, and CB.InitCallBackState3) can't find the server record, they abort the call from the file server with RX_CALL_DEAD when they should return okay. Fixes: c35eccb1f614 ("[AFS]: Implement the CB.InitCallBackState3 operation.") Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Add a tracepoint to record callbacks from unlisted serversDavid Howells2018-05-142-3/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a tracepoint to record callbacks from servers for which we don't have a record. Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix the handling of CB.InitCallBackState3 to find the server by UUIDDavid Howells2018-05-141-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the handling of the CB.InitCallBackState3 service call to find the record of a server that we're using by looking it up by the UUID passed as the parameter rather than by its address (of which it might have many, and which may change). Fixes: c35eccb1f614 ("[AFS]: Implement the CB.InitCallBackState3 operation.") Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix VNOVOL handling in address rotationDavid Howells2018-05-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a volume location record lists multiple file servers for a volume, then it's possible that due to a misconfiguration or a changing configuration that one of the file servers doesn't know about it yet and will abort VNOVOL. Currently, the rotation algorithm will stop with EREMOTEIO. Fix this by moving on to try the next server if VNOVOL is returned. Once all the servers have been tried and the record rechecked, the algorithm will stop with EREMOTEIO or ENOMEDIUM. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibilityDavid Howells2018-05-141-8/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The OpenAFS server's RXAFS_InlineBulkStatus implementation has a bug whereby if an error occurs on one of the vnodes being queried, then the errorCode field is set correctly in the corresponding status, but the interfaceVersion field is left unset. Fix kAFS to deal with this by evaluating the AFSFetchStatus blob against the following cases when called from FS.InlineBulkStatus delivery: (1) If InterfaceVersion == 0 then: (a) If errorCode != 0 then it indicates the abort code for the corresponding vnode. (b) If errorCode == 0 then the status record is invalid. (2) If InterfaceVersion == 1 then: (a) If errorCode != 0 then it indicates the abort code for the corresponding vnode. (b) If errorCode == 0 then the status record is valid and can be parsed. (3) If InterfaceVersion is anything else then the status record is invalid. Fixes: dd9fbcb8e103 ("afs: Rearrange status mapping") Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix server rotation's handling of fileserver probe failureDavid Howells2018-05-141-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The server rotation algorithm just gives up if it fails to probe a fileserver. Fix this by rotating to the next fileserver instead. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix refcounting in callback registrationDavid Howells2018-05-144-22/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The refcounting on afs_cb_interest struct objects in afs_register_server_cb_interest() is wrong as it uses the server list entry's call back interest pointer without regard for the fact that it might be replaced at any time and the object thrown away. Fix this by: (1) Put a lock on the afs_server_list struct that can be used to mediate access to the callback interest pointers in the servers array. (2) Keep a ref on the callback interest that we get from the entry. (3) Dropping the old reference held by vnode->cb_interest if we replace the pointer. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix giving up callbacks on server destructionDavid Howells2018-05-143-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a server record is destroyed, we want to send a message to the server telling it that we're giving up all the callbacks it has promised us. Apply two fixes to this: (1) Only send the FS.GiveUpAllCallBacks message if we actually got a callback from that server. We assume this to be the case if we performed at least one successful FS operation on that server. (2) Send it to the address last used for that server rather than always picking the first address in the list (which might be unreachable). Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix address list parsingDavid Howells2018-05-141-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The parsing of port specifiers in the address list obtained from the DNS resolution upcall doesn't work as in4_pton() and in6_pton() will fail on encountering an unexpected delimiter (in this case, the '+' marking the port number). However, in*_pton() can't be given multiple specifiers. Fix this by finding the delimiter in advance and not relying on in*_pton() to find the end of the address for us. Fixes: 8b2a464ced77 ("afs: Add an address list concept") Signed-off-by: David Howells <dhowells@redhat.com>
| * | | afs: Fix directory page lockingDavid Howells2018-05-144-24/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The afs directory loading code (primarily afs_read_dir()) locks all the pages that hold a directory's content blob to defend against getdents/getdents races and getdents/lookup races where the competitors issue conflicting reads on the same data. As the reads will complete consecutively, they may retrieve different versions of the data and one may overwrite the data that the other is busy parsing. Fix this by not locking the pages at all, but rather by turning the validation lock into an rwsem and getting an exclusive lock on it whilst reading the data or validating the attributes and a shared lock whilst parsing the data. Sharing the attribute validation lock should be fine as the data fetch will retrieve the attributes also. The individual page locks aren't needed at all as the only place they're being used is to serialise data loading. Without this patch, the: if (!test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags)) { ... } part of afs_read_dir() may be skipped, leaving the pages unlocked when we hit the success: clause - in which case we try to unlock the not-locked pages, leading to the following oops: page:ffffe38b405b4300 count:3 mapcount:0 mapping:ffff98156c83a978 index:0x0 flags: 0xfffe000001004(referenced|private) raw: 000fffe000001004 ffff98156c83a978 0000000000000000 00000003ffffffff raw: dead000000000100 dead000000000200 0000000000000001 ffff98156b27c000 page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffff98156b27c000 ------------[ cut here ]------------ kernel BUG at mm/filemap.c:1205! ... RIP: 0010:unlock_page+0x43/0x50 ... Call Trace: afs_dir_iterate+0x789/0x8f0 [kafs] ? _cond_resched+0x15/0x30 ? kmem_cache_alloc_trace+0x166/0x1d0 ? afs_do_lookup+0x69/0x490 [kafs] ? afs_do_lookup+0x101/0x490 [kafs] ? key_default_cmp+0x20/0x20 ? request_key+0x3c/0x80 ? afs_lookup+0xf1/0x340 [kafs] ? __lookup_slow+0x97/0x150 ? lookup_slow+0x35/0x50 ? walk_component+0x1bf/0x490 ? path_lookupat.isra.52+0x75/0x200 ? filename_lookup.part.66+0xa0/0x170 ? afs_end_vnode_operation+0x41/0x60 [kafs] ? __check_object_size+0x9c/0x171 ? strncpy_from_user+0x4a/0x170 ? vfs_statx+0x73/0xe0 ? __do_sys_newlstat+0x39/0x70 ? __x64_sys_getdents+0xc9/0x140 ? __x64_sys_getdents+0x140/0x140 ? do_syscall_64+0x5b/0x160 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: f3ddee8dc4e2 ("afs: Fix directory handling") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
* | | | Merge tag 'scsi-fixes' of ↵Linus Torvalds2018-05-152-5/+5
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Two small driver fixes: aacraid to fix an unknown IU type on task management functions which causes a firmware fault and vmw_pvscsi to change a return code to retry the operation instead of causing an immediate error" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: aacraid: Correct hba_send to include iu_type scsi: vmw-pvscsi: return DID_BUS_BUSY for adapter-initated aborts
| * | | | scsi: aacraid: Correct hba_send to include iu_typeDave Carroll2018-05-021-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b60710ec7d7a ("scsi: aacraid: enable sending of TMFs from aac_hba_send()") allows aac_hba_send() to send scsi commands, and TMF requests, but the existing code only updates the iu_type for scsi commands. For TMF requests we are sending an unknown iu_type to firmware, which causes a fault. Include iu_type prior to determining the validity of the command Reported-by: Noah Misner <nmisner@us.ibm.com> Fixes: b60710ec7d7ab ("aacraid: enable sending of TMFs from aac_hba_send()") Fixes: 423400e64d377 ("aacraid: Include HBA direct interface") Tested-by: Noah Misner <nmisner@us.ibm.com> cc: stable@vger.kernel.org Signed-off-by: Dave Carroll <david.carroll@microsemi.com> Reviewed-by: Raghava Aditya Renukunta <RaghavaAditya.Renukunta@microsemi.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| * | | | scsi: vmw-pvscsi: return DID_BUS_BUSY for adapter-initated abortsJim Gill2018-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vmw_pvscsi driver returns DID_ABORT for commands aborted internally by the adapter, leading to the filesystem going read-only. Change the result to DID_BUS_BUSY, causing the kernel to retry the command. Signed-off-by: Jim Gill <jgill@vmware.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
* | | | | Merge tag 'drm-fixes-for-v4.17-rc6-urgent' of ↵Linus Torvalds2018-05-151-0/+1
|\ \ \ \ \ | |_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://people.freedesktop.org/~airlied/linux Pull drm fix from Dave Airlie: "This fixes the mmap regression reported to me on irc by an i686 kernel user today, he's tested the fix works, and I've audited all the drm drivers for the bad mmap usage and since we use the mmap offset as a lookup in a table we aren't inclined to have anything bad in there" [ See commit be83bbf80682 ("mmap: introduce sane default mmap limits") for details and the note on why the GPU drivers were expected to be a special case. - Linus ] * tag 'drm-fixes-for-v4.17-rc6-urgent' of git://people.freedesktop.org/~airlied/linux: drm: set FMODE_UNSIGNED_OFFSET for drm files
| * | | | drm: set FMODE_UNSIGNED_OFFSET for drm filesDave Airlie2018-05-151-0/+1
|/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we have the ttm and gem vma managers using a subset of the file address space for objects, and these start at 0x100000000 they will overflow the new mmap checks. I've checked all the mmap routines I could see for any bad behaviour but overall most people use GEM/TTM VMA managers even the legacy drivers have a hashtable. Reported-and-Tested-by: Arthur Marsh (amarsh04 on #radeon) Fixes: be83bbf8068 (mmap: introduce sane default mmap limits) Signed-off-by: Dave Airlie <airlied@redhat.com>
* | | | Linux 4.17-rc5v4.17-rc5Linus Torvalds2018-05-141-1/+1
| | | |
* | | | Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds2018-05-1314-78/+153
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/pti updates from Thomas Gleixner: "A mixed bag of fixes and updates for the ghosts which are hunting us. The scheduler fixes have been pulled into that branch to avoid conflicts. - A set of fixes to address a khread_parkme() race which caused lost wakeups and loss of state. - A deadlock fix for stop_machine() solved by moving the wakeups outside of the stopper_lock held region. - A set of Spectre V1 array access restrictions. The possible problematic spots were discuvered by Dan Carpenters new checks in smatch. - Removal of an unused file which was forgotten when the rest of that functionality was removed" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Remove unused file perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msr perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driver perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map() perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_* perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[] sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[] sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[] sched/core: Introduce set_special_state() kthread, sched/wait: Fix kthread_parkme() completion issue kthread, sched/wait: Fix kthread_parkme() wait-loop sched/fair: Fix the update of blocked load when newly idle stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
| * | | | x86/vdso: Remove unused fileJann Horn2018-05-051-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit da861e18eccc ("x86, vdso: Get rid of the fake section mechanism") left this file behind; nothing is using it anymore. Signed-off-by: Jann Horn <jannh@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: luto@amacapital.net Link: http://lkml.kernel.org/r/20180504175935.104085-1-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msrPeter Zijlstra2018-05-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > arch/x86/events/intel/cstate.c:307 cstate_pmu_event_init() warn: potential spectre issue 'pkg_msr' (local cap) Userspace controls @attr, sanitize cfg (attr->config) before using it to index an array. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driverPeter Zijlstra2018-05-051-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > arch/x86/events/msr.c:178 msr_event_init() warn: potential spectre issue 'msr' (local cap) Userspace controls @attr, sanitize cfg (attr->config) before using it to index an array. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map()Peter Zijlstra2018-05-051-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > arch/x86/events/intel/cstate.c:307 cstate_pmu_event_init() warn: potential spectre issue 'pkg_msr' (local cap) > arch/x86/events/intel/core.c:337 intel_pmu_event_map() warn: potential spectre issue 'intel_perfmon_event_map' > arch/x86/events/intel/knc.c:122 knc_pmu_event_map() warn: potential spectre issue 'knc_perfmon_event_map' > arch/x86/events/intel/p4.c:722 p4_pmu_event_map() warn: potential spectre issue 'p4_general_events' > arch/x86/events/intel/p6.c:116 p6_pmu_event_map() warn: potential spectre issue 'p6_perfmon_event_map' > arch/x86/events/amd/core.c:132 amd_pmu_event_map() warn: potential spectre issue 'amd_perfmon_event_map' Userspace controls @attr, sanitize @attr->config before passing it on to x86_pmu::event_map(). Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_*Peter Zijlstra2018-05-051-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > arch/x86/events/core.c:319 set_ext_hw_attr() warn: potential spectre issue 'hw_cache_event_ids[cache_type]' (local cap) > arch/x86/events/core.c:319 set_ext_hw_attr() warn: potential spectre issue 'hw_cache_event_ids' (local cap) > arch/x86/events/core.c:328 set_ext_hw_attr() warn: potential spectre issue 'hw_cache_extra_regs[cache_type]' (local cap) > arch/x86/events/core.c:328 set_ext_hw_attr() warn: potential spectre issue 'hw_cache_extra_regs' (local cap) Userspace controls @config which contains 3 (byte) fields used for a 3 dimensional array deref. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]Peter Zijlstra2018-05-051-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > kernel/events/ring_buffer.c:871 perf_mmap_to_page() warn: potential spectre issue 'rb->aux_pages' Userspace controls @pgoff through the fault address. Sanitize the array index before doing the array dereference. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]Peter Zijlstra2018-05-051-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > kernel/sched/autogroup.c:230 proc_sched_autogroup_set_nice() warn: potential spectre issue 'sched_prio_to_weight' Userspace controls @nice, sanitize the array index. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]Peter Zijlstra2018-05-051-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > kernel/sched/core.c:6921 cpu_weight_nice_write_s64() warn: potential spectre issue 'sched_prio_to_weight' Userspace controls @nice, so sanitize the value before using it to index an array. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/core: Introduce set_special_state()Peter Zijlstra2018-05-044-24/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Gaurav reported a perceived problem with TASK_PARKED, which turned out to be a broken wait-loop pattern in __kthread_parkme(), but the reported issue can (and does) in fact happen for states that do not do condition based sleeps. When the 'current->state = TASK_RUNNING' store of a previous (concurrent) try_to_wake_up() collides with the setting of a 'special' sleep state, we can loose the sleep state. Normal condition based wait-loops are immune to this problem, but for sleep states that are not condition based are subject to this problem. There already is a fix for TASK_DEAD. Abstract that and also apply it to TASK_STOPPED and TASK_TRACED, both of which are also without condition based wait-loop. Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | kthread, sched/wait: Fix kthread_parkme() completion issuePeter Zijlstra2018-05-033-35/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even with the wait-loop fixed, there is a further issue with kthread_parkme(). Upon hotplug, when we do takedown_cpu(), smpboot_park_threads() can return before all those threads are in fact blocked, due to the placement of the complete() in __kthread_parkme(). When that happens, sched_cpu_dying() -> migrate_tasks() can end up migrating such a still runnable task onto another CPU. Normally the task will have hit schedule() and gone to sleep by the time we do kthread_unpark(), which will then do __kthread_bind() to re-bind the task to the correct CPU. However, when we loose the initial TASK_PARKED store to the concurrent wakeup issue described previously, do the complete(), get migrated, it is possible to either: - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate the park and set TASK_RUNNING, or - __kthread_bind()'s wait_task_inactive() to observe the competing TASK_RUNNING store. Either way the WARN() in __kthread_bind() will trigger and fail to correctly set the CPU affinity. Fix this by only issuing the complete() when the kthread has scheduled out. This does away with all the icky 'still running' nonsense. The alternative is to promote TASK_PARKED to a special state, this guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING and we'll end up doing the right thing, but this preserves the whole icky business of potentially migating the still runnable thing. Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | kthread, sched/wait: Fix kthread_parkme() wait-loopPeter Zijlstra2018-05-031-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Gaurav reported a problem with __kthread_parkme() where a concurrent try_to_wake_up() could result in competing stores to ->state which, when the TASK_PARKED store got lost bad things would happen. The comment near set_current_state() actually mentions this competing store, but only mentions the case against TASK_RUNNING. This same store, with different timing, can happen against a subsequent !RUNNING store. This normally is not a problem, because as per that same comment, the !RUNNING state store is inside a condition based wait-loop: for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (!need_sleep) break; schedule(); } __set_current_state(TASK_RUNNING); If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous (concurrent) wakeup, the schedule() will NO-OP and we'll go around the loop once more. The problem here is that the TASK_PARKED store is not inside the KTHREAD_SHOULD_PARK condition wait-loop. There is a genuine issue with sleeps that do not have a condition; this is addressed in a subsequent patch. Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/fair: Fix the update of blocked load when newly idleVincent Guittot2018-05-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit: 31e77c93e432 ("sched/fair: Update blocked load when newly idle") ... we release the rq->lock when updating blocked load of idle CPUs. This opens a time window during which another CPU can add a task to this CPU's cfs_rq. The check for newly added task of idle_balance() is not in the common path. Move the out label to include this check. Reported-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle") Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlockPeter Zijlstra2018-05-031-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Matt reported the following deadlock: CPU0 CPU1 schedule(.prev=migrate/0) <fault> pick_next_task() ... idle_balance() migrate_swap() active_balance() stop_two_cpus() spin_lock(stopper0->lock) spin_lock(stopper1->lock) ttwu(migrate/0) smp_cond_load_acquire() -- waits for schedule() stop_one_cpu(1) spin_lock(stopper1->lock) -- waits for stopper lock Fix this deadlock by taking the wakeups out from under stopper->lock. This allows the active_balance() to queue the stop work and finish the context switch, which in turn allows the wakeup from migrate_swap() to observe the context and complete the wakeup. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180420095005.GH4064@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2018-05-131-56/+1
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "Revert the new NUMA aware placement approach which turned out to create more problems than it solved" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
| * | | | | Revert "sched/numa: Delay retrying placement for automatic NUMA balance ↵Mel Gorman2018-05-121-56/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | after wake_affine()" This reverts commit 7347fc87dfe6b7315e74310ee1243dc222c68086. Srikar Dronamra pointed out that while the commit in question did show a performance improvement on ppc64, it did so at the cost of disabling active CPU migration by automatic NUMA balancing which was not the intent. The issue was that a serious flaw in the logic failed to ever active balance if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled, the logic is still bizarre and against the original intent. Investigation showed that fixing the patch in either the way he suggested, using the correct comparison for jiffies values or introducing a new numa_migrate_deferred variable in task_struct all perform similarly to a revert with a mix of gains and losses depending on the workload, machine and socket count. The original intent of the commit was to handle a problem whereby wake_affine, idle balancing and automatic NUMA balancing disagree on the appropriate placement for a task. This was particularly true for cases where a single task was a massive waker of tasks but where wake_wide logic did not apply. This was particularly noticeable when a futex (a barrier) woke all worker threads and tried pulling the wakees to the waker nodes. In that specific case, it could be handled by tuning MPI or openMP appropriately, but the behavior is not illogical and was worth attempting to fix. However, the approach was wrong. Given that we're at rc4 and a fix is not obvious, it's better to play safe, revert this commit and retry later. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: ggherdovich@suse.cz Cc: hpa@zytor.com Cc: matt@codeblueprint.co.uk Cc: mpe@ellerman.id.au Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>