summaryrefslogtreecommitdiffstats
path: root/kernel/workqueue.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* lockdep: fix oops in processing workqueuePeter Zijlstra2012-05-151-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under memory load, on x86_64, with lockdep enabled, the workqueue's process_one_work() has been seen to oops in __lock_acquire(), barfing on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[]. Because it's permissible to free a work_struct from its callout function, the map used is an onstack copy of the map given in the work_struct: and that copy is made without any locking. Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than "rep movsq" for that structure copy: which might race with a workqueue user's wait_on_work() doing lock_map_acquire() on the source of the copy, putting a pointer into the class_cache[], but only in time for the top half of that pointer to be copied to the destination map. Boom when process_one_work() subsequently does lock_map_acquire() on its onstack copy of the lockdep_map. Fix this, and a similar instance in call_timer_fn(), with a lockdep_copy_map() function which additionally NULLs the class_cache[]. Note: this oops was actually seen on 3.4-next, where flush_work() newly does the racing lock_map_acquire(); but Tejun points out that 3.4 and earlier are already vulnerable to the same through wait_on_work(). * Patch orginally from Peter. Hugh modified it a bit and wrote the description. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Hugh Dickins <hughd@google.com> LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is ↵Tejun Heo2012-05-151-2/+7
| | | | | | | | | | | | | | | | | | | | | active worker_enter_idle() has WARN_ON_ONCE() which triggers if nr_running isn't zero when every worker is idle. This can trigger spuriously while a cpu is going down due to the way trustee sets %WORKER_ROGUE and zaps nr_running. It first sets %WORKER_ROGUE on all workers without updating nr_running, releases gcwq->lock, schedules, regrabs gcwq->lock and then zaps nr_running. If the last running worker enters idle inbetween, it would see stale nr_running which hasn't been zapped yet and trigger the WARN_ON_ONCE(). Fix it by performing the sanity check iff the trustee is idle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: stable@vger.kernel.org
* workqueue: Catch more locking problems with flush_work()Stephen Boyd2012-04-231-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a workqueue is flushed with flush_work() lockdep checking can be circumvented. For example: static DEFINE_MUTEX(mutex); static void my_work(struct work_struct *w) { mutex_lock(&mutex); mutex_unlock(&mutex); } static DECLARE_WORK(work, my_work); static int __init start_test_module(void) { schedule_work(&work); return 0; } module_init(start_test_module); static void __exit stop_test_module(void) { mutex_lock(&mutex); flush_work(&work); mutex_unlock(&mutex); } module_exit(stop_test_module); would not always print a warning when flush_work() was called. In this trivial example nothing could go wrong since we are guaranteed module_init() and module_exit() don't run concurrently, but if the work item is schedule asynchronously we could have a scenario where the work item is running just at the time flush_work() is called resulting in a classic ABBA locking problem. Add a lockdep hint by acquiring and releasing the work item lockdep_map in flush_work() so that we always catch this potential deadlock scenario. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: change BUG_ON() to WARN_ON()Dan Carpenter2012-04-161-1/+4
| | | | | | | | | | This BUG_ON() can be triggered if you call schedule_work() before calling INIT_WORK(). It is a bug definitely, but it's nicer to just print a stack trace and return. Reported-by: Matt Renzelmann <mjr@cs.wisc.edu> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* Merge branch 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2012-03-211-19/+3
|\ | | | | | | | | | | | | | | | | Pull workqueue changes from Tejun Heo: "This contains only one commit which cleans up UP allocation path a bit." * 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: use percpu allocator for cwq on UP
| * workqueue: use percpu allocator for cwq on UPLai Jiangshan2012-03-121-19/+3
| | | | | | | | | | | | | | | | I notice that the commit bbddff makes percpu allocator can work on UP, So we don't need the magic way for UP. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | Block: use a freezable workqueue for disk-event pollingAlan Stern2012-03-021-1/+6
|/ | | | | | | | | | | | | | | | | | | | This patch (as1519) fixes a bug in the block layer's disk-events polling. The polling is done by a work routine queued on the system_nrt_wq workqueue. Since that workqueue isn't freezable, the polling continues even in the middle of a system sleep transition. Obviously, polling a suspended drive for media changes and such isn't a good thing to do; in the case of USB mass-storage devices it can lead to real problems requiring device resets and even re-enumeration. The patch fixes things by creating a new system-wide, non-reentrant, freezable workqueue and using it for disk-events polling. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> CC: <stable@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* workqueue: make alloc_workqueue() take printf fmt and args for nameTejun Heo2012-01-111-10/+22
| | | | | | | | | | | | | | | | | | | | | | alloc_workqueue() currently expects the passed in @name pointer to remain accessible. This is inconvenient and a bit silly given that the whole wq is being dynamically allocated. This patch updates alloc_workqueue() and friends to take printf format string instead of opaque string and matching varargs at the end. The name is allocated together with the wq and formatted. alloc_ordered_workqueue() is converted to a macro to unify varargs handling with alloc_workqueue(), and, while at it, add comment to alloc_workqueue(). None of the current in-kernel users pass in string with '%' as constant name and this change shouldn't cause any problem. [akpm@linux-foundation.org: use __printf] Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel: Map most files to use export.h instead of module.hPaul Gortmaker2011-10-311-1/+1
| | | | | | | | | | | | | | | | The changed files were only including linux/module.h for the EXPORT_SYMBOL infrastructure, and nothing else. Revector them onto the isolated export header for faster compile times. Nothing to see here but a whole lot of instances of: -#include <linux/module.h> +#include <linux/export.h> This commit is only changing the kernel dir; next targets will probably be mm, fs, the arch dirs, etc. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* workqueue: lock cwq access in drain_workqueueThomas Tuttle2011-09-151-1/+6
| | | | | | | | | | | | | | | | | Take cwq->gcwq->lock to avoid racing between drain_workqueue checking to make sure the workqueues are empty and cwq_dec_nr_in_flight decrementing and then incrementing nr_active when it activates a delayed work. We discovered this when a corner case in one of our drivers resulted in us trying to destroy a workqueue in which the remaining work would always requeue itself again in the same workqueue. We would hit this race condition and trip the BUG_ON on workqueue.c:3080. Signed-off-by: Thomas Tuttle <ttuttle@chromium.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-3.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2011-07-231-28/+53
|\ | | | | | | | | | | * 'for-3.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: separate out drain_workqueue() from destroy_workqueue() workqueue: remove cancel_rearming_delayed_work[queue]()
| * workqueue: separate out drain_workqueue() from destroy_workqueue()Tejun Heo2011-05-201-28/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There are users which want to drain workqueues without destroying it. Separate out drain functionality from destroy_workqueue() into drain_workqueue() and make it accessible to workqueue users. To guarantee forward-progress, only chain queueing is allowed while drain is in progress. If a new work item which isn't chained from the running or pending work items is queued while draining is in progress, WARN_ON_ONCE() is triggered. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
* | Merge branch 'for-2.6.40' of ↵Linus Torvalds2011-05-241-3/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: Unify input section names percpu: Avoid extra NOP in percpu_cmpxchg16b_double percpu: Cast away printk format warning percpu: Always align percpu output section to PAGE_SIZE Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun
| * percpu: Always align percpu output section to PAGE_SIZETejun Heo2011-03-241-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Percpu allocator honors alignment request upto PAGE_SIZE and both the percpu addresses in the percpu address space and the translated kernel addresses should be aligned accordingly. The calculation of the former depends on the alignment of percpu output section in the kernel image. The linker script macros PERCPU_VADDR() and PERCPU() are used to define this output section and the latter takes @align parameter. Several architectures are using @align smaller than PAGE_SIZE breaking percpu memory alignment. This patch removes @align parameter from PERCPU(), renames it to PERCPU_SECTION() and makes it always align to PAGE_SIZE. While at it, add PCPU_SETUP_BUG_ON() checks such that alignment problems are reliably detected and remove percpu alignment comment recently added in workqueue.c as the condition would trigger BUG way before reaching there. For um, this patch raises the alignment of percpu area. As the area is in .init, there shouldn't be any noticeable difference. This problem was discovered by David Howells while debugging boot failure on mn10300. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Mike Frysinger <vapier@gentoo.org> Cc: uclinux-dist-devel@blackfin.uclinux.org Cc: David Howells <dhowells@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: user-mode-linux-devel@lists.sourceforge.net
* | workqueue: fix deadlock in worker_maybe_bind_and_lock()Tejun Heo2011-04-291-1/+7
|/ | | | | | | | | | | | | | | | | | | | | | | | | If a rescuer and stop_machine() bringing down a CPU race with each other, they may deadlock on non-preemptive kernel. The CPU won't accept a new task, so the rescuer can't migrate to the target CPU, while stop_machine() can't proceed because the rescuer is holding one of the CPU retrying migration. GCWQ_DISASSOCIATED is never cleared and worker_maybe_bind_and_lock() retries indefinitely. This problem can be reproduced semi reliably while the system is entering suspend. http://thread.gmane.org/gmane.linux.kernel/1122051 A lot of kudos to Thilo-Alexander for reporting this tricky issue and painstaking testing. stable: This affects all kernels with cmwq, so all kernels since and including v2.6.36 need this fix. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Thilo-Alexander Ginkel <thilo@ginkel.com> Tested-by: Thilo-Alexander Ginkel <thilo@ginkel.com> Cc: stable@kernel.org
* kthread: use kthread_create_on_node()Eric Dumazet2011-03-231-2/+4
| | | | | | | | | | | | | | | | | | ksoftirqd, kworker, migration, and pktgend kthreads can be created with kthread_create_on_node(), to get proper NUMA affinities for their stack and task_struct. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2011-03-161-1/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix build failure introduced by s/freezeable/freezable/ workqueue: add system_freezeable_wq rds/ib: use system_wq instead of rds_ib_fmr_wq net/9p: replace p9_poll_task with a work net/9p: use system_wq instead of p9_mux_wq xfs: convert to alloc_workqueue() reiserfs: make commit_wq use the default concurrency level ocfs2: use system_wq instead of ocfs2_quota_wq ext4: convert to alloc_workqueue() scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path scsi/be2iscsi,qla2xxx: convert to alloc_workqueue() misc/iwmc3200top: use system_wq instead of dedicated workqueues i2o: use alloc_workqueue() instead of create_workqueue() acpi: kacpi*_wq don't need WQ_MEM_RECLAIM fs/aio: aio_wq isn't used in memory reclaim path input/tps6507x-ts: use system_wq instead of dedicated workqueue cpufreq: use system_wq instead of dedicated workqueues wireless/ipw2x00: use system_wq instead of dedicated workqueues arm/omap: use system_wq in mailbox workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
| * workqueue: fix build failure introduced by s/freezeable/freezable/Tejun Heo2011-02-211-5/+5
| | | | | | | | | | | | | | | | | | | | wq:fixes-2.6.38 does s/WQ_FREEZEABLE/WQ_FREEZABLE and wq:for-2.6.39 adds new usage of the flag. The combination of the two creates a build failure after merge. Fix it by renaming all freezeables to freezables. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * Merge branch 'master' into for-2.6.39Tejun Heo2011-02-211-13/+24
| |\
| * | workqueue: add system_freezeable_wqTejun Heo2011-02-091-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | Add system wide freezeable workqueue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
* | | debugobjects: Add hint for better object identificationStanislaw Gruszka2011-03-081-0/+6
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In complex subsystems like mac80211 structures can contain several timers and work structs, so identifying a specific instance from the call trace and object type output of debugobjects can be hard. Allow the subsystems which support debugobjects to provide a hint function. This function returns a pointer to a kernel address (preferrably the objects callback function) which is printed along with the debugobjects type. Add hint methods for timer_list, work_struct and hrtimer. [ tglx: Massaged changelog, made it compile ] Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <20110307085809.GA9334@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | workqueue: make sure MAYDAY_INITIAL_TIMEOUT is at least 2 jiffies longTejun Heo2011-02-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | MAYDAY_INITIAL_TIMEOUT is defined as HZ / 100 and depending on configuration may end up 0 or 1. Even when it's 1, depending on when the mayday timer is added in the current jiffy interval, it may expire way before a jiffy has passed. Make sure MAYDAY_INITIAL_TIMEOUT is at least two to guarantee that at least a full jiffy has passed before calling rescuers. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ray Jui <rjui@broadcom.com> Cc: stable@kernel.org
* | workqueue, freezer: unify spelling of 'freeze' + 'able' to 'freezable'Tejun Heo2011-02-161-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two spellings in use for 'freeze' + 'able' - 'freezable' and 'freezeable'. The former is the more prominent one. The latter is mostly used by workqueue and in a few other odd places. Unify the spelling to 'freezable'. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Dmitry Torokhov <dtor@mail.ru> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Alex Dubov <oakad@yahoo.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Steven Whitehouse <swhiteho@redhat.com>
* | workqueue: wake up a worker when a rescuer is leaving a gcwqTejun Heo2011-02-141-0/+9
|/ | | | | | | | | | | | | | After executing the matching works, a rescuer leaves the gcwq whether there are more pending works or not. This may decrease the concurrency level to zero and stall execution until a new work item is queued on the gcwq. Make rescuer wake up a regular worker when it leaves a gcwq if there are more works to execute, so that execution isn't stalled. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ray Jui <rjui@broadcom.com> Cc: stable@kernel.org
* workqueue: note the nested NOT_RUNNING test in worker_clr_flags() isn't a noopTejun Heo2011-01-111-1/+5
| | | | | | | | | | | The nested NOT_RUNNING test in worker_clr_flags() is slightly misleading in that if NOT_RUNNING were a single flag the nested test would be always %true and thus noop. Add a comment noting that the test isn't a noop. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Hillf Danton <dhillf@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org>
* workqueue: relax lockdep annotation on flush_work()Tejun Heo2011-01-111-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the lockdep annotation in flush_work() requires exclusive access on the workqueue the target work is queued on and triggers warning if a work is trying to flush another work on the same workqueue; however, this is no longer true as workqueues can now execute multiple works concurrently. This patch adds lock_map_acquire_read() and make process_one_work() hold read access to the workqueue while executing a work and start_flush_work() check for write access if concurrnecy level is one or the workqueue has a rescuer (as only one execution resource - the rescuer - is guaranteed to be available under memory pressure), and read access if higher. This better represents what's going on and removes spurious lockdep warnings which are triggered by fake dependency chain created through flush_work(). * Peter pointed out that flushing another work from a WQ_MEM_RECLAIM wq breaks forward progress guarantee under memory pressure. Condition check accordingly updated. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl> Tested-by: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org
* workqueue: allow chained queueing during destructionTejun Heo2010-12-201-1/+59
| | | | | | | | | | | | | | | | | | | | | Currently, destroy_workqueue() makes the workqueue deny all new queueing by setting WQ_DYING and flushes the workqueue once before proceeding with destruction; however, there are cases where work items queue more related work items. Currently, such users need to explicitly flush the workqueue multiple times depending on the possible depth of such chained queueing. This patch updates the queueing path such that a work item can queue further work items on the same workqueue even when WQ_DYING is set. The flush on destruction is automatically retried until the workqueue is empty. This guarantees that the workqueue is empty on destruction while allowing chained queueing. The flush retry logic whines if it takes too many retries to drain the workqueue. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
* workqueue: It is likely that WORKER_NOT_RUNNING is trueSteven Rostedt2010-12-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running the annotate branch profiler on three boxes, including my main box that runs firefox, evolution, xchat, and is part of the distcc farm, showed this with the likelys in the workqueue code: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 96 996253 99 wq_worker_sleeping workqueue.c 703 96 996247 99 wq_worker_waking_up workqueue.c 677 The likely()s in this case were assuming that WORKER_NOT_RUNNING will most likely be false. But this is not the case. The reason is (and shown by adding trace_printks and testing it) that most of the time WORKER_PREP is set. In worker_thread() we have: worker_clr_flags(worker, WORKER_PREP); [ do work stuff ] worker_set_flags(worker, WORKER_PREP, false); (that 'false' means not to wake up an idle worker) The wq_worker_sleeping() is called from schedule when a worker thread is putting itself to sleep. Which happens most of the time outside of that [ do work stuff ]. The wq_worker_waking_up is called by the wakeup worker code, which is also callod outside that [ do work stuff ]. Thus, the likely and unlikely used by those two functions are actually backwards. Remove the annotation and let gcc figure it out. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: check the allocation of system_unbound_wqHitoshi Mitake2010-11-261-1/+2
| | | | | | | | | | | | I found a trivial bug on initialization of workqueue. Current init_workqueues doesn't check the result of allocation of system_unbound_wq, this should be checked like other queues. Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueues: s/ON_STACK/ONSTACK/Andrew Morton2010-10-271-1/+1
| | | | | | | | | | | | | | | Silly though it is, completions and wait_queue_heads use foo_ONSTACK (COMPLETION_INITIALIZER_ONSTACK, DECLARE_COMPLETION_ONSTACK, __WAIT_QUEUE_HEAD_INIT_ONSTACK and DECLARE_WAIT_QUEUE_HEAD_ONSTACK) so I guess workqueues should do the same thing. s/INIT_WORK_ON_STACK/INIT_WORK_ONSTACK/ s/INIT_DELAYED_WORK_ON_STACK/INIT_DELAYED_WORK_ONSTACK/ Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* MN10300: Fix the PERCPU() alignment to allow for workqueuesDavid Howells2010-10-261-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the MN10300 arch, we occasionally see an assertion being tripped in alloc_cwqs() at the following line: /* just in case, make sure it's actually aligned */ ---> BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align)); return wq->cpu_wq.v ? 0 : -ENOMEM; The values are: wa->cpu_wq.v => 0x902776e0 align => 0x100 and align is calculated by the following: const size_t align = max_t(size_t, 1 << WORK_STRUCT_FLAG_BITS, __alignof__(unsigned long long)); This is because the pointer in question (wq->cpu_wq.v) loses some of its lower bits to control flags, and so the object it points to must be sufficiently aligned to avoid the need to use those bits for pointing to things. Currently, 4 control bits and 4 colour bits are used in normal circumstances, plus a debugging bit if debugging is set. This requires the cpu_workqueue_struct struct to be at least 256 bytes aligned (or 512 bytes aligned with debugging). PERCPU() alignment on MN13000, however, is only 32 bytes as set in vmlinux.lds.S. So we set this to PAGE_SIZE (4096) to match most other arches and stick a comment in alloc_cwqs() for anyone else who triggers the assertion. Reported-by: Akira Takeuchi <takeuchi.akr@jp.panasonic.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Mark Salter <msalter@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* workqueue: remove in_workqueue_context()Tejun Heo2010-10-191-15/+0
| | | | | | | | | | Commit a25909a4 (lockdep: Add an in_workqueue_context() lockdep-based test function) added in_workqueue_context() but there hasn't been any in-kernel user and the lockdep annotation in workqueue is scheduled to change. Remove the unused function. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* workqueue: Clarify that schedule_on_each_cpu is synchronousTejun Heo2010-10-191-4/+6
| | | | | | | | | | | | | The documentation for schedule_on_each_cpu() states that it calls a function on each online CPU from keventd. This can easily be interpreted as an asyncronous call because the description does not mention that flush_work is called. Clarify that it is synchronous. tj: rephrased a bit Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: add and use WQ_MEM_RECLAIM flagTejun Heo2010-10-111-0/+7
| | | | | | | | | | | | | | | | | | | Add WQ_MEM_RECLAIM flag which currently maps to WQ_RESCUER, mark WQ_RESCUER as internal and replace all external WQ_RESCUER usages to WQ_MEM_RECLAIM. This makes the API users express the intent of the workqueue instead of indicating the internal mechanism used to guarantee forward progress. This is also to make it cleaner to add more semantics to WQ_MEM_RECLAIM. For example, if deemed necessary, memory reclaim workqueues can be made highpri. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Steven Whitehouse <swhiteho@redhat.com>
* workqueue: fix HIGHPRI handling in keep_working()Tejun Heo2010-10-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | The policy function keep_working() didn't check GCWQ_HIGHPRI_PENDING and could return %false with highpri work pending. This could lead to late execution of a highpri work which was delayed due to @max_active throttling if other works are actively consuming CPU cycles. For example, the following could happen. 1. Work W0 which burns CPU cycles. 2. Two works W1 and W2 are queued to a highpri wq w/ @max_active of 1. 3. W1 starts executing and W2 is put to delayed queue. W0 and W1 are both runnable. 4. W1 finishes which puts W2 to pending queue but keep_working() incorrectly returns %false and the worker goes to sleep. 5. W0 finishes and W2 starts execution. With this patch applied, W2 starts execution as soon as W1 finishes. Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: add queue_work and activate_work trace pointsTejun Heo2010-10-051-0/+3
| | | | | | | | | These two tracepoints allow tracking when and how a work is queued and activated. This patch is based on Frederic's patch to add queue_work trace point. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
* workqueue: prepare for more tracepointsTejun Heo2010-10-051-3/+3
| | | | | | | | | | Define workqueue_work event class and use it for workqueue_execute_end trace point. Also, move trace/events/workqueue.h include downwards such that all struct definitions are visible to it. This is to prepare for more tracepoints and doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
* workqueue: implement flush[_delayed]_work_sync()Tejun Heo2010-09-191-0/+56
| | | | | | | | | | Implement flush[_delayed]_work_sync(). These are flush functions which also make sure no CPU is still executing the target work from earlier queueing instances. These are similar to cancel[_delayed]_work_sync() except that the target work item is flushed instead of cancelled. Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: factor out start_flush_work()Tejun Heo2010-09-191-27/+37
| | | | | | | | | | | | Factor out start_flush_work() from flush_work(). start_flush_work() has @wait_executing argument which controls whether the barrier is queued only if the work is pending or also if executing. As flush_work() needs to wait for execution too, it uses %true. This commit doesn't cause any behavior difference. start_flush_work() will be used to implement flush_work_sync(). Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: cleanup flush/cancel functionsTejun Heo2010-09-191-81/+94
| | | | | | | | | | | | | | | | | | | | | Make the following cleanup changes. * Relocate flush/cancel function prototypes and definitions. * Relocate wait_on_cpu_work() and wait_on_work() before try_to_grab_pending(). These will be used to implement flush_work_sync(). * Make all flush/cancel functions return bool instead of int. * Update wait_on_cpu_work() and wait_on_work() to return %true if they actually waited. * Add / update comments. This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: add documentationTejun Heo2010-09-131-10/+17
| | | | | | | | | | | | | Update copyright notice and add Documentation/workqueue.txt. Randy Dunlap, Dave Chinner: misc fixes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-By: Florian Mickler <florian@mickler.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Dave Chinner <david@fromorbit.com>
* Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2010-09-071-15/+38
|\ | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: use zalloc_cpumask_var() for gcwq->mayday_mask workqueue: fix GCWQ_DISASSOCIATED initialization workqueue: Add a workqueue chapter to the tracepoint docbook workqueue: fix cwq->nr_active underflow workqueue: improve destroy_workqueue() debuggability workqueue: mark lock acquisition on worker_maybe_bind_and_lock() workqueue: annotate lock context change workqueue: free rescuer on destroy_workqueue
| * workqueue: use zalloc_cpumask_var() for gcwq->mayday_maskTejun Heo2010-08-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | alloc_mayday_mask() was using alloc_cpumask_var() making gcwq->mayday_mask contain garbage after initialization on CONFIG_CPUMASK_OFFSTACK=y configurations. This combined with the previously fixed GCWQ_DISASSOCIATED initialization bug could make rescuers fall into infinite loop trying to bind to an offline cpu. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: CAI Qian <caiqian@redhat.com>
| * workqueue: fix GCWQ_DISASSOCIATED initializationTejun Heo2010-08-311-2/+3
| | | | | | | | | | | | | | | | | | | | | | init_workqueues() incorrectly marks workqueues for all possible CPUs associated. Combined with mayday_mask initialization bug, this can make rescuers keep trying to bind to an offline gcwq indefinitely. Fix init_workqueues() such that only online CPUs have their gcwqs have GCWQ_DISASSOCIATED cleared. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: CAI Qian <caiqian@redhat.com>
| * workqueue: fix cwq->nr_active underflowTejun Heo2010-08-251-10/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cwq->nr_active is used to keep track of how many work items are active for the cpu workqueue, where 'active' is defined as either pending on global worklist or executing. This is used to implement the max_active limit and workqueue freezing. If a work item is queued after nr_active has already reached max_active, the work item doesn't increment nr_active and is put on the delayed queue and gets activated later as previous active work items retire. try_to_grab_pending() which is used in the cancellation path unconditionally decremented nr_active whether the work item being cancelled is currently active or delayed, so cancelling a delayed work item makes nr_active underflow. This breaks max_active enforcement and triggers BUG_ON() in destroy_workqueue() later on. This patch fixes this bug by adding a flag WORK_STRUCT_DELAYED, which is set while a work item in on the delayed list and making try_to_grab_pending() decrement nr_active iff the work item is currently active. The addition of the flag enlarges cwq alignment to 256 bytes which is getting a bit too large. It's scheduled to be reduced back to 128 bytes by merging WORK_STRUCT_PENDING and WORK_STRUCT_CWQ in the next devel cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Johannes Berg <johannes@sipsolutions.net>
| * workqueue: improve destroy_workqueue() debuggabilityTejun Heo2010-08-241-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the worklist is global, having works pending after wq destruction can easily lead to oops and destroy_workqueue() have several BUG_ON()s to catch these cases. Unfortunately, BUG_ON() doesn't tell much about how the work became pending after the final flush_workqueue(). This patch adds WQ_DYING which is set before the final flush begins. If a work is requested to be queued on a dying workqueue, WARN_ON_ONCE() is triggered and the request is ignored. This clearly indicates which caller is trying to queue a work on a dying workqueue and keeps the system working in most cases. Locking rule comment is updated such that the 'I' rule includes modifying the field from destruction path. Signed-off-by: Tejun Heo <tj@kernel.org>
| * workqueue: mark lock acquisition on worker_maybe_bind_and_lock()Namhyung Kim2010-08-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | worker_maybe_bind_and_lock() actually grabs gcwq->lock but was missing proper annotation. Add it. So this patch will remove following sparse warnings: kernel/workqueue.c:1214:13: warning: context imbalance in 'worker_maybe_bind_and_lock' - wrong count at exit arch/x86/include/asm/irqflags.h:44:9: warning: context imbalance in 'worker_rebind_fn' - unexpected unlock kernel/workqueue.c:1991:17: warning: context imbalance in 'rescuer_thread' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * workqueue: annotate lock context changeNamhyung Kim2010-08-231-0/+6
| | | | | | | | | | | | | | | | Some of internal functions called within gcwq->lock context releases and regrabs the lock but were missing proper annotations. Add it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * workqueue: free rescuer on destroy_workqueueXiaotian Feng2010-08-161-1/+1
| | | | | | | | | | | | | | | | | | wq->rescuer is not freed when wq is destroyed, leads a memory leak then. This patch also remove a redundant line. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>
* | workqueue: Add basic tracepoints to track workqueue executionArjan van de Ven2010-08-211-0/+9
|/ | | | | | | | | | | | | | | | | | | | | | With the introduction of the new unified work queue thread pools, we lost one feature: It's no longer possible to know which worker is causing the CPU to wake out of idle. The result is that PowerTOP now reports a lot of "kworker/a:b" instead of more readable results. This patch adds a pair of tracepoints to the new workqueue code, similar in style to the timer/hrtimer tracepoints. With this pair of tracepoints, the next PowerTOP can correctly report which work item caused the wakeup (and how long it took): Interrupt (43) i915 time 3.51ms wakeups 141 Work ieee80211_iface_work time 0.81ms wakeups 29 Work do_dbs_timer time 0.55ms wakeups 24 Process Xorg time 21.36ms wakeups 4 Timer sched_rt_period_timer time 0.01ms wakeups 1 Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>