summaryrefslogtreecommitdiffstats
path: root/block (follow)
Commit message (Collapse)AuthorAgeFilesLines
* blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()Tejun Heo2015-08-191-4/+1
| | | | | | | | | | blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy enabled are guaranteed to return non-NULL and the counterpart in blk-throttle doesn't have these checks either. Remove the spurious NULL checks. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: reduce stack usage of blkg_rwstat_recursive_sum()Tejun Heo2015-08-191-6/+4
| | | | | | | | | | | | | | | | | | | The recent percpu conversion of blkg_rwstat triggered the following warning in certain configurations. block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes This is because blkg_rwstat now contains four percpu_counter which can be pretty big depending on debug options although it shouldn't be a problem in production configs. This patch removes one of the two local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to reduce stack usage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835 Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: remove cfqg_stats->sectorsTejun Heo2015-08-191-19/+36
| | | | | | | | | | | | | | | | | | cfq_stats->sectors is a blkg_stat which keeps track of the total number of sectors serviced; however, this can be trivially calculated from blkcg_gq->stat_bytes. The only thing necessary is adding up READs and WRITEs and then dividing by sector size. Remove cfqg_stats->sectors and make cfq print "sectors" and "sectors_recursive" from stat_bytes. While this is a bit more code, it removes duplicate stat allocations and updates and ensures that the reported stats stay in tune with each other. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: move io_service_bytes and io_serviced stats into blkcg_gqTejun Heo2015-08-193-90/+113
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, both cfq-iosched and blk-throttle keep track of io_service_bytes and io_serviced stats. While keeping track of them separately may be useful during development, it doesn't make much sense otherwise. Also, blk-throttle was counting bio's as IOs while cfq-iosched request's, which is more confusing than informative. This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq), removes the counterparts from cfq-iosched and blk-throttle and let them print from the common blkg counters. The common counters are incremented during bio issue in blkcg_bio_issue_check(). The outputs are still filtered by whether the policy has blkg_policy_data on a given blkg, so cfq's output won't show up if it has never been used for a given blkg. The only times when the outputs would differ significantly are when policies are attached on the fly or elevators are switched back and forth. Those are quite exceptional operations and I don't think they warrant keeping separate counters. v3: Update blkio-controller.txt accordingly. v2: Account IOs during bio issues instead of request completions so that bio-based drivers can be handled the same way. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gqTejun Heo2015-08-192-31/+46
| | | | | | | | | | | | | | | | Currently, blkg_[rw]stat_recursive_sum() assume that the target counter is located in pd (blkg_policy_data); however, some counters are planned to be moved to blkg (blkcg_gq). This patch updates blkg_[rw]stat_recursive_sum() to take blkg and blkg_policy pointers instead of pd. If policy is NULL, it indexes into blkg. If non-NULL, into the blkg's pd of the policy. The existing usages are updated to maintain the current behaviors. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: make blkcg_[rw]stat per-cpuTejun Heo2015-08-193-83/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | blkcg_[rw]stat are used as stat counters for blkcg policies. It isn't per-cpu by itself and blk-throttle makes it per-cpu by wrapping around it. This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc per-cpu wrapping in blk-throttle. * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct percpu_counter. This makes syncp unnecessary as remote accesses are handled by percpu_counter itself. * blkg_[rw]stat_init() can now fail due to percpu allocation failure and thus are updated to return int. * percpu_counters need explicit freeing. blkg_[rw]stat_exit() added. * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading and summing results are stored in ->aux_cnt[] instead. * Custom per-cpu stat implementation in blk-throttle is removed. This makes all blkcg stat counters per-cpu without complicating policy implmentations. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with itTejun Heo2015-08-192-53/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup stats are local to each cgroup and doesn't propagate to ancestors by default. When recursive stats are necessary, the sum is calculated over all the descendants. This initially was for backward compatibility to support both group-local and recursive stats but this mode of operation makes general sense as stat update is much hotter thafn reporting those stats. This however ends up losing recursive stats when a child is removed. To work around this, cfq-iosched adds its stats to its parent cfq_group->dead_stats which is summed up together when calculating recursive stats. It's planned that the core stats will be moved to blkcg_gq, so we want to move the mechanism for keeping track of the stats of dead children from cfq to blkcg core. This patch adds blkg_[rw]stat->aux_cnt which are atomic64_t's keeping track of auxiliary counts which are excluded when reading local counts but included for recursive. blkg_[rw]stat_merge() which were used by cfq to implement dead_stats are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of a dead cgroup to the aux counts of parent->stats instead of separate ->dead_stats. This will also help making blkg_[rw]stats per-cpu. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: consolidate blkg creation in blkcg_bio_issue_check()Tejun Heo2015-08-195-97/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkg (blkcg_gq) currently is created by blkcg policies invoking blkg_lookup_create() which ends up repeating about the same code in different policies. Theoretically, this can avoid the overhead of looking and/or creating blkg's if blkcg is enabled but no policy is in use; however, the cost of blkg lookup / creation is very low especially if only the root blkcg is in use which is highly likely if no blkcg policy is in active use - it boils down to a single very predictable conditional and surrounding RCU protection. This patch consolidates blkg creation to a new function blkcg_bio_issue_check() which is called during bio issue from generic_make_request_checks(). blkcg_bio_issue_check() is now the only function which tries to create missing blkg's. The subsequent policy and request_list operations just perform blkg_lookup() and if missing falls back to the root. * blk_get_rl() no longer tries to create blkg. It uses blkg_lookup() instead of blkg_lookup_create(). * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu read locked and blkg already looked up. Both throtl_lookup_tg() and throtl_lookup_create_tg() are dropped. * cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with cfq_lookup_cfqg()which uses blkg_lookup(). This consolidates blkg handling and avoids unnecessary blkg creation retries under memory pressure. In addition, this provides a common bio entry point into blkcg where things like common accounting can be performed. v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and !CONFIG_BLK_DEV_THROTTLING. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-throttle: improve queue bypass handlingTejun Heo2015-08-191-3/+4
| | | | | | | | | | | | | | | | | If a queue is bypassing, all blkcg policies should become noops but blk-throttle wasn't. It only became noop if the queue was dying. While this wouldn't lead to an oops as falling back to the root blkg is safe in this case, this can be a bit surprising - a bypassing queue could still be applying throttle limits. Fix it by removing blk_queue_dying() test in throtl_lookup_create_tg() and testing blk_queue_bypass() in blk_throtl_bio() and bypassing before doing anything else. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: move root blkg lookup optimization from throtl_lookup_tg() to ↵Tejun Heo2015-08-191-7/+0
| | | | | | | | | | | | | | | __blkg_lookup() Currently, both throttle and cfq policies implement their own root blkg (blkcg_gq) lookup fast path. This patch moves root blkg optimization from throtl_lookup_tg() to __blkg_lookup(). cfq-iosched currently doesn't use blkg_lookup() but will be converted and drop the optimization too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: inline [__]blkg_lookup()Tejun Heo2015-08-191-36/+2
| | | | | | | | | | | | | | | | | blkg_lookup() checks whether the target queue is bypassing and, if not, calls __blkg_lookup() which first checks the lookup hint and then performs radix tree walk. The operations upto hint checking are trivial and there are many users of this function. This patch inlines blkg_lookup() and the fast path part of __blkg_lookup(). The radix tree lookup and hint update are now in blkg_lookup_slowpath(). This will help consolidating blkg handling by easing moving root blkcg short-circuit to inlined lookup fast path. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methodsTejun Heo2015-08-192-16/+42
| | | | | | | | | | | | | | | | | | | | | | | | Each active policy has a cpd (blkcg_policy_data) on each blkcg. The cpd's were allocated by blkcg core and each policy could request to allocate extra space at the end by setting blkcg_policy->cpd_size larger than the size of cpd. This is a bit unusual but blkg (blkcg_gq) policy data used to be handled this way too so it made sense to be consistent; however, blkg policy data switched to alloc/free callbacks. This patch makes similar changes to cpd handling. blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size. As cpd allocation is now done from policy side, it can simply allocate a larger area which embeds cpd at the beginning. As ->cpd_alloc_fn() may be able to perform all necessary initializations, this patch makes ->cpd_init_fn() optional. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: minor updates around blkcg_policy_dataTejun Heo2015-08-192-17/+18
| | | | | | | | | | | | | | | | | | | | * Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used for blkcg_policy_data. * Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of blkcg. This makes it consistent with blkg_policy_data methods and to-be-added cpd alloc/free methods. * blkcg_policy_data->blkcg and cpd_to_blkcg() added so that cpd_init_fn() can determine the associated blkcg from blkcg_policy_data. v2: blkcg_policy_data->blkcg initializations were missing. Added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: make blkcg_policy methods take a pointer to blkcg_policy_dataTejun Heo2015-08-193-23/+22
| | | | | | | | | | | | | | | | | The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd (blkg_policy_data) while the older ones use blkg (blkcg_gq). As using blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd can always be mapped to blkg and given that these are policy-specific methods, it makes sense to converge on pd. This patch makes all methods deal with pd instead of blkg. Most conversions are trivial. In blk-cgroup.c, a couple method invocation sites now test whether pd exists instead of policy state for consistency. This shouldn't cause any behavioral differences. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-throttle: clean up blkg_policy_data alloc/init/exit/free methodsTejun Heo2015-08-193-52/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | With the recent addition of alloc and free methods, things became messier. This patch reorganizes them according to the followings. * ->pd_alloc_fn() Responsible for allocation and static initializations - the ones which can be done independent of where the pd might be attached. * ->pd_init_fn() Initializations which require the knowledge of where the pd is attached. * ->pd_free_fn() The counter part of pd_alloc_fn(). Static de-init and freeing. This leaves ->pd_exit_fn() without any users. Removed. While at it, collapse an one liner function throtl_pd_exit(), which has only one user, into its user. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-throttle: remove asynchrnous percpu stats allocation mechanismTejun Heo2015-08-191-87/+25
| | | | | | | | | | | | | | | | | | | Because percpu allocator couldn't do non-blocking allocations, blk-throttle was forced to implement an ad-hoc asynchronous allocation mechanism for its percpu stats for cases where blkg's (blkcg_gq's) are allocated from an IO path without sleepable context. Now that percpu allocator can handle gfp_mask and blkg_policy_data alloc / free are handled by policy methods, the ad-hoc asynchronous allocation mechanism can be replaced with direct allocation from tg_stats_alloc_fn(). Rit it out. This ensures that an active throtl_grp always has valid non-NULL ->stats_cpu. Remove checks on it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methodsTejun Heo2015-08-193-12/+35
| | | | | | | | | | | | | | | | | | | | | | | A blkg (blkcg_gq) represents the relationship between a cgroup and request_queue. Each active policy has a pd (blkg_policy_data) on each blkg. The pd's were allocated by blkcg core and each policy could request to allocate extra space at the end by setting blkcg_policy->pd_size larger than the size of pd. This is a bit unusual but was done this way mostly to simplify error handling and all the existing use cases could be handled this way; however, this is becoming too restrictive now that percpu memory can be allocated without blocking. This introduces two new mandatory blkcg_policy methods - pd_alloc_fn() and pd_free_fn() - which are used to allocate and release pd for a given policy. As pd allocation is now done from policy side, it can simply allocate a larger area which embeds pd at the beginning. This change makes ->pd_size pointless. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: make blkcg_activate_policy() allow NULL ->pd_init_fnTejun Heo2015-08-191-1/+2
| | | | | | | | | | | blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy() doesn't. As both in-kernel policies implement ->pd_init_fn, it currently doesn't break anything. Update blkcg_activate_policy() so that its behavior is consistent with blkg_create(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy()Tejun Heo2015-08-191-34/+21
| | | | | | | | | | | | | | | | | | When a policy gets activated, it needs to allocate and install its policy data on all existing blkg's (blkcg_gq's). Because blkg iteration is protected by a spinlock, it currently counts the total number of blkg's in the system, allocates the matching number of policy data on a list and installs them during a single iteration. This can be simplified by using speculative GFP_NOWAIT allocations while iterating and falling back to a preallocated policy data on failure. If the preallocated one has already been consumed, it releases the lock, preallocate with GFP_KERNEL and then restarts the iteration. This can be a bit more expensive than before but policy activation is a very cold path and shouldn't matter. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: remove unnecessary blkcg_root handling from css_alloc/free pathsTejun Heo2015-08-191-15/+10
| | | | | | | | | | | | | | | | blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free() bypasses policy data and blkcg freeing for blkcg_root. There's no reason to to treat policy data any differently for blkcg_root. If the root css gets allocated after policies are registered, policy registration path will add policy data; otherwise, the alloc path will. The free path isn't never invoked for root csses. This patch removes the unnecessary special handling of blkcg_root from css_alloc/free paths. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: use blkg_free() in blkcg_init_queue() failure pathTejun Heo2015-08-191-2/+3
| | | | | | | | | | | | | | | When blkcg_init_queue() fails midway after creating a new blkg, it performs kfree() directly; however, this doesn't free the policy data areas. Make it use blkg_free() instead. In turn, blkg_free() is updated to handle root request_list special case. While this fixes a possible memory leak, it's on an unlikely failure path of an already cold path and the size leaked per occurrence is miniscule too. I don't think it needs to be tagged for -stable. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: charge async IOs to the appropriate blkcg's instead of the rootTejun Heo2015-08-191-43/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up until now, all async IOs were queued to async queues which are shared across the whole request_queue, which means that blkcg resource control is completely void on async IOs including all writeback IOs. It was done this way because writeback didn't support writeback and there was no way of telling which writeback IO belonged to which cgroup; however, writeback recently became cgroup aware and writeback bio's are sent down properly tagged with the blkcg's to charge them against. This patch makes async cfq_queues per-cfq_cgroup instead of per-cfq_data so that each async IO is charged to the blkcg that it was tagged for instead of unconditionally attributing it to root. * cfq_data->async_cfqq and ->async_idle_cfqq are moved to cfq_group and alloc / destroy paths are updated accordingly. * cfq_link_cfqq_cfqg() no longer overrides @cfqg to root for async queues. * check_blkcg_changed() now also invalidates async queues as they no longer stay the same across cgroups. After this patch, cfq's proportional IO control through blkio.weight works correctly when cgroup writeback is in use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: fold cfq_find_alloc_queue() into cfq_get_queue()Tejun Heo2015-08-191-34/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | cfq_find_alloc_queue() checks whether a queue actually needs to be allocated, which is unnecessary as its sole caller, cfq_get_queue(), only calls it if so. Also, the oom queue fallback logic is scattered between cfq_get_queue() and cfq_find_alloc_queue(). There really isn't much going on in the latter and things can be made simpler by folding it into cfq_get_queue(). This patch collapses cfq_find_alloc_queue() into cfq_get_queue(). The change is fairly straight-forward with one exception - async_cfqq is now initialized to NULL and the "!is_sync" test in the last if conditional is replaced with "async_cfqq" test. This is because gcc (5.1.1) gets confused for some reason and warns that async_cfqq may be used uninitialized otherwise. Oh well, the code isn't necessarily worse this way. This patch doesn't cause any functional difference. v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: move cfq_group determination from cfq_find_alloc_queue() to ↵Tejun Heo2015-08-191-16/+12
| | | | | | | | | | | | | | | | | cfq_get_queue() This is necessary for making async cfq_cgroups per-cfq_group instead of per-cfq_data. While this change makes cfq_get_queue() perform RCU locking and look up cfq_group even when it reuses async queue, the extra overhead is extremely unlikely to be noticeable given that this is already sitting behind cic->cfqq[] cache and the overall cost of cfq operation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: remove @gfp_mask from cfq_find_alloc_queue()Tejun Heo2015-08-191-35/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | Even when allocations fail, cfq_find_alloc_queue() always returns a valid cfq_queue by falling back to the oom cfq_queue. As such, there isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT is set. GFP_NOWAIT allocations don't fail often and even when they do the degraded behavior is acceptable and temporary. After all, the only reason get_request(), which ultimately determines the gfp_mask, cares about __GFP_WAIT is to guarantee request allocation, assuming IO forward progress, for callers which are willing to wait. There's no reason for cfq_find_alloc_queue() to behave differently on __GFP_WAIT when it already has a fallback mechanism. Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes to its callers. This simplifies the function quite a bit and will help making async queues per-cfq_group. v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg, cfq-iosched: use GFP_NOWAIT instead of GFP_ATOMIC for non-critical ↵Tejun Heo2015-08-192-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | allocations blkcg performs several allocations to track IOs per cgroup and enforce resource control. Most of these allocations are performed lazily on demand in the IO path and thus can't involve reclaim path. Currently, these allocations use GFP_ATOMIC; however, blkcg can gracefully deal with occassional failures of these allocations by punting IOs to the root cgroup and there's no reason to reach into the emergency reserve. This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following allocations. * bdi_writeback_congested and blkcg_gq allocations in blkg_create(). * radix tree node allocations for blkcg->blkg_tree. * cfq_queue allocation on ioprio changes. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Suggested-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: minor cleanupsTejun Heo2015-08-191-15/+11
| | | | | | | | | | | | | | | | * Some were accessing cic->cfqq[] directly. Always use cic_to_cfqq() and cic_set_cfqq(). * check_ioprio_changed() doesn't need to verify cfq_get_queue()'s return for NULL. It's always non-NULL. Simplify accordingly. This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: fix oom cfq_queue ref leak in cfq_set_request()Tejun Heo2015-08-191-0/+2
| | | | | | | | | | | | | | | | | If the cfq_queue cached in cfq_io_cq is the oom one, cfq_set_request() replaces it by invoking cfq_get_queue() again without putting the oom queue leaking the reference it was holding. While oom queues are not released through reference counting, they're still reference counted and this can theoretically lead to the reference count overflowing and incorrectly invoke the usual release path on it. Fix it by making cfq_set_request() put the ref it was holding. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: fix async oom queue handlingTejun Heo2015-08-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | Async cfqq's (cfq_queue's) are shared across cfq_data. When cfq_get_queue() obtains a new queue from cfq_find_alloc_queue(), it stashes the pointer in cfq_data and reuses it from then on; however, the function doesn't consider that cfq_find_alloc_queue() may return the oom_cfqq under memory pressure and installs the returned queue unconditionally. If the oom_cfqq is installed as an async cfqq, cfq_set_request() will continue calling cfq_get_queue() hoping to replace it with a proper queue; however, cfq_get_queue() will keep returning the cached queue for the slot - the oom_cfqq. Fix it by skipping caching if the queue is the oom one. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: simplify control flow in cfq_get_queue()Tejun Heo2015-08-191-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cfq_get_queue()'s control flow looks like the following. async_cfqq = NULL; cfqq = NULL; if (!is_sync) { ... async_cfqq = ...; cfqq = *async_cfqq; } if (!cfqq) cfqq = ...; if (!is_sync && !(*async_cfqq)) ...; The only thing the local variable init, the second if, and the async_cfqq test in the third if achieves is to skip cfqq creation and installation if *async_cfqq was already non-NULL. This is needlessly complicated with different tests examining the same condition. Simplify it to the following. if (!is_sync) { ... async_cfqq = ...; cfqq = *async_cfqq; if (cfqq) goto out; } cfqq = ...; if (!is_sync) ...; out: Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Do a full clone when splitting discard biosMartin K. Petersen2015-07-241-3/+12
| | | | | | | | | | | | | | | | | | | | | | This fixes a data corruption bug when using discard on top of MD linear, raid0 and raid10 personalities. Commit 20d0189b1012 "block: Introduce new bio_split()" permits sharing the bio_vec between the two resulting bios. That is fine for read/write requests where the bio_vec is immutable. For discards, however, we need to be able to attach a payload and update the bio_vec so the page can get mapped to a scatterlist entry. Therefore the bio_vec can not be shared when splitting discards and we must do a full clone. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Seunguk Shin <seunguk.shin@samsung.com> Tested-by: Seunguk Shin <seunguk.shin@samsung.com> Cc: Seunguk Shin <seunguk.shin@samsung.com> Cc: Jens Axboe <axboe@fb.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: <stable@vger.kernel.org> # v3.14+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: export bio_associate_*() and wbc_account_io()Tejun Heo2015-07-231-0/+2
| | | | | | | | | | bio_associate_blkcg(), bio_associate_current() and wbc_account_io() are used to implement cgroup writeback support for filesystems and thus need to be exported. Export them. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: fix gendisk reference leak in blkg_conf_prep()Tejun Heo2015-07-231-1/+5
| | | | | | | | | | When a blkcg configuration is targeted to a partition rather than a whole device, blkg_conf_prep fails with -EINVAL; unfortunately, it forgets to put the gendisk ref in that case. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: set default timeout as 30 secondsMing Lei2015-07-161-1/+1
| | | | | | | | | | It is reasonable to set default timeout of request as 30 seconds instead of 30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64 based systems may choose 100 HZ. Signed-off-by: Ming Lei <ming.lei@canonical.com> Fixes: c76cbbcf4044 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()" Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: fix blkcg_policy_data allocation bugTejun Heo2015-07-091-36/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data") updated per-blkcg policy data to be dynamically allocated. When a policy is registered, its policy data aren't created. Instead, when the policy is activated on a queue, the policy data are allocated if there are blkg's (blkcg_gq's) which are attached to a given blkcg. This is buggy. Consider the following scenario. 1. A blkcg is created. No blkg's attached yet. 2. The policy is registered. No policy data is allocated. 3. The policy is activated on a queue. As the above blkcg doesn't have any blkg's, it won't allocate the matching blkcg_policy_data. 4. An IO is issued from the blkcg and blkg is created and the blkcg still doesn't have the matching policy data allocated. With cfq-iosched, this leads to an oops. It also doesn't free policy data on policy unregistration assuming that freeing of all policy data on blkcg destruction should take care of it; however, this also is incorrect. 1. A blkcg has policy data. 2. The policy gets unregistered but the policy data remains. 3. Another policy gets registered on the same slot. 4. Later, the new policy tries to allocate policy data on the previous blkcg but the slot is already occupied and gets skipped. The policy ends up operating on the policy data of the previous policy. There's no reason to manage blkcg_policy_data lazily. The reason we do lazy allocation of blkg's is that the number of all possible blkg's is the product of cgroups and block devices which can reach a surprising level. blkcg_policy_data is contrained by the number of cgroups and shouldn't be a problem. This patch makes blkcg_policy_data to be allocated for all existing blkcg's on policy registration and freed on unregistration and removes blkcg_policy_data handling from policy [de]activation paths. This makes that blkcg_policy_data are created and removed with the policy they belong to and fixes the above described problems. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data") Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: implement all_blkcgs listTejun Heo2015-07-091-5/+12
| | | | | | | | | | | Add all_blkcgs list goes through blkcg->all_blkcgs_node and is protected by blkcg_pol_mutex. This will be used to fix blkcg_policy_data allocation bug. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: blkcg_css_alloc() should grab blkcg_pol_mutex while iterating ↵Tejun Heo2015-07-091-0/+4
| | | | | | | | | | | | | | | | | | | blkcg_policy[] An entry in blkcg_policy[] is stable while there are non-bypassing in-flight IOs on a request_queue which has the policy activated. This is why most derefs of blkcg_policy[] don't need explicit locking; however, blkcg_css_alloc() isn't invoked from IO path and thus doesn't have this protection and may race policies being added and removed. Fix it by adding explicit blkcg_pol_mutex protection around blkcg_policy[] iteration in blkcg_css_alloc(). Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data") Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blkcg: allow blkcg_pol_mutex to be grabbed from cgroup [file] methodsTejun Heo2015-07-091-19/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkcg_pol_mutex primarily protects the blkcg_policy array. It also protects cgroup file type [un]registration during policy addition / removal. This puts blkcg_pol_mutex outside cgroup internal synchronization and in turn makes it impossible to grab from blkcg's cgroup methods as that leads to cyclic dependency. Another problematic dependency arising from this is through cgroup interface file deactivation. Removing a cftype requires removing all files of the type which in turn involves draining all on-going invocations of the file methods. This means that an interface file implementation can't grab blkcg_pol_mutex as draining can lead to AA deadlock. blkcg_reset_stats() is already in this situation. It currently trylocks blkcg_pol_mutex and then unwinds and retries the whole operation on failure, which is cumbersome at best. It has a lengthy comment explaining how cgroup internal synchronization is involved and expected to be updated but as explained above this doesn't need cgroup internal locking to deadlock. It's a self-contained AA deadlock. The described circular dependencies can be easily broken by moving cftype [un]registration out of blkcg_pol_mutex and protect them with an outer mutex. This patch introduces blkcg_pol_register_mutex which wraps entire policy [un]registration including cftype operations and shrinks blkcg_pol_mutex critical section. This also makes the trylock dancing in blkcg_reset_stats() unnecessary. Removed. This patch is necessary for the following blkcg_policy_data allocation bug fixes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block/blk-cgroup.c: free per-blkcg data when freeing the blkcgArianna Avanzini2015-07-071-3/+6
| | | | | | | | | | | | | | | Currently, per-blkcg data is freed each time a policy is deactivated, that is also upon scheduler switch. However, when switching from a scheduler implementing a policy which requires per-blkcg data to another one, that same policy might be active on other devices, and therefore those same per-blkcg data could be still in use. This commit lets per-blkcg data be freed when the blkcg is freed instead of on policy deactivation. Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Reported-and-tested-by: Michael Kaminsky <kaminsky@cs.cmu.edu> Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data") Signed-off-by: Jens Axboe <axboe@fb.com>
* block: use FIELD_SIZEOF to calculate size of a fieldManinder Singh2015-07-071-1/+1
| | | | | | | use FIELD_SIZEOF instead of open coding Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* bio integrity: do not assume bio_integrity_pool exists if bioset existsMike Snitzer2015-07-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bio_integrity_alloc() and bio_integrity_free() assume that if a bio was allocated from a bioset that that bioset also had its bio_integrity_pool allocated using bioset_integrity_create(). This is a very bad assumption given that bioset_create() and bioset_integrity_create() are completely disjoint. Not all callers of bioset_create() have been trained to also call bioset_integrity_create() -- and they may not care to be. Fix this by falling back to kmalloc'ing 'struct bio_integrity_payload' rather than force all bioset consumers to (wastefully) preallocate a bio_integrity_pool that they very likely won't actually need (given the niche nature of the current block integrity support). Otherwise, a NULL pointer "Kernel BUG" with a trace like the following will be observed (as seen on s390x using zfcp storage) because dm-io doesn't use bioset_integrity_create() when creating its bioset: [ 791.643338] Call Trace: [ 791.643339] ([<00000003df98b848>] 0x3df98b848) [ 791.643341] [<00000000002c5de8>] bio_integrity_alloc+0x48/0xf8 [ 791.643348] [<00000000002c6486>] bio_integrity_prep+0xae/0x2f0 [ 791.643349] [<0000000000371e38>] blk_queue_bio+0x1c8/0x3d8 [ 791.643355] [<000000000036f8d0>] generic_make_request+0xc0/0x100 [ 791.643357] [<000000000036f9b2>] submit_bio+0xa2/0x198 [ 791.643406] [<000003ff801f9774>] dispatch_io+0x15c/0x3b0 [dm_mod] [ 791.643419] [<000003ff801f9b3e>] dm_io+0x176/0x2f0 [dm_mod] [ 791.643423] [<000003ff8074b28a>] do_reads+0x13a/0x1a8 [dm_mirror] [ 791.643425] [<000003ff8074b43a>] do_mirror+0x142/0x298 [dm_mirror] [ 791.643428] [<0000000000154fca>] process_one_work+0x18a/0x3f8 [ 791.643432] [<000000000015598a>] worker_thread+0x132/0x3b0 [ 791.643435] [<000000000015d49a>] kthread+0xd2/0xd8 [ 791.643438] [<00000000005bc0ca>] kernel_thread_starter+0x6/0xc [ 791.643446] [<00000000005bc0c4>] kernel_thread_starter+0x0/0xc Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
* block: fix bogus EFAULT error from SG_IO ioctlPaolo Bonzini2015-06-271-2/+2
| | | | | | | | | | | | | | | Whenever blk_fill_sghdr_rq fails, its errno code is ignored and changed to EFAULT. This can cause very confusing errors: $ sg_persist -k /dev/sda persistent reservation in: pass through os error: Bad address The fix is trivial, just propagate the return value from blk_fill_sghdr_rq. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge tag 'dm-4.2-fixes' of ↵Linus Torvalds2015-06-261-11/+83
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "Apologies for not pressing this request-based DM partial completion issue further, it was an oversight on my part. We'll have to get it fixed up properly and revisit for a future release. - Revert block and DM core changes the removed request-based DM's ability to handle partial request completions -- otherwise with the current SCSI LLDs these changes could lead to silent data corruption. - Fix two DM version bumps that were missing from the initial 4.2 DM pull request (enabled userspace lvm2 to know certain changes have been made)" * tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache policy smq: fix "default" version to be 1.4.0 dm: bump the ioctl version to 4.32.0 Revert "block, dm: don't copy bios for request clones" Revert "dm: do not allocate any mempools for blk-mq request-based DM"
| * Revert "block, dm: don't copy bios for request clones"Mike Snitzer2015-06-261-11/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 5f1b670d0bef508a5554d92525f5f6d00d640b38. Justification for revert as reported in this dm-devel post: https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html this change should not be pushed to mainline yet. Firstly, Christoph has a newer version of the patch that fixes silent data corruption problem: https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html And the new version still depends on LLDDs to always complete requests to the end when error happens, while block API doesn't enforce such a requirement. If the assumption is ever broken, the inconsistency between request and bio (e.g. rq->__sector and rq->bio) will cause silent data corruption: https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * Merge remote-tracking branch 'jens/for-4.2/core' into dm-4.2Mike Snitzer2015-05-298-212/+208
| |\
* | \ Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds2015-06-2611-728/+139
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
| * | | writeback, blkcg: propagate non-root blkcg congestion stateTejun Heo2015-06-021-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that bdi layer can handle per-blkcg bdi_writeback_congested state, blk_{set|clear}_congested() can propagate non-root blkcg congestion state to them. This can be easily achieved by disabling the root_rl tests in blk_{set|clear}_congested(). Note that we still need those tests when !CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg wb's congestion state for events happening on other blkcgs. v2: Updated for bdi_writeback_congested. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | writeback, blkcg: restructure blk_{set|clear}_queue_congested()Tejun Heo2015-06-021-25/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_{set|clear}_queue_congested() take @q and set or clear, respectively, the congestion state of its bdi's root wb. Because bdi used to be able to handle congestion state only on the root wb, the callers of those functions tested whether the congestion is on the root blkcg and skipped if not. This is cumbersome and makes implementation of per cgroup bdi_writeback congestion state propagation difficult. This patch renames blk_{set|clear}_queue_congested() to blk_{set|clear}_congested(), and makes them take request_list instead of request_queue and test whether the specified request_list is the root one before updating bdi_writeback congestion state. This makes the tests in the callers unnecessary and simplifies them. As there are no external users of these functions, the definitions are moved from include/linux/blkdev.h to block/blk-core.c. This patch doesn't introduce any noticeable behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | writeback, blkcg: associate each blkcg_gq with the corresponding ↵Tejun Heo2015-06-021-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bdi_writeback_congested A blkg (blkcg_gq) can be congested and decongested independently from other blkgs on the same request_queue. Accordingly, for cgroup writeback support, the congestion status at bdi (backing_dev_info) should be split and updated separately from matching blkg's. This patch prepares by adding blkg->wb_congested and associating a blkg with its matching per-blkcg bdi_writeback_congested on creation. v2: Updated to associate bdi_writeback_congested instead of bdi_writeback. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | writeback: make backing_dev_info host cgroup-specific bdi_writebacksTejun Heo2015-06-021-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v3: inode_attach_wb() in account_page_dirtied() moved inside mapping_cap_account_dirty() block where it's known to be !NULL. Also, an unnecessary NULL check before kfree() removed. Both detected by the kbuild bot. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kbuild test robot <fengguang.wu@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>