summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* block: genhd: add 'groups' argument to device_add_diskHannes Reinecke2018-09-2828-33/+43
| | | | | | | | | | | | | | Update device_add_disk() to take an 'groups' argument so that individual drivers can register a device with additional sysfs attributes. This avoids race condition the driver would otherwise have if these groups were to be created with sysfs_add_groups(). Signed-off-by: Martin Wilck <martin.wilck@suse.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kyber: add tracepointsOmar Sandoval2018-09-282-18/+130
| | | | | | | | | | When debugging Kyber, it's really useful to know what latencies we've been having, how the domain depths have been adjusted, and if we've actually been throttling. Add three tracepoints, kyber_latency, kyber_adjust, and kyber_throttled, to record that. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kyber: implement improved heuristicsOmar Sandoval2018-09-281-218/+279
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kyber's current heuristics have a few flaws: - It's based on the mean latency, but p99 latency tends to be more meaningful to anyone who cares about latency. The mean can also be skewed by rare outliers that the scheduler can't do anything about. - The statistics calculations are purely time-based with a short window. This works for steady, high load, but is more sensitive to outliers with bursty workloads. - It only considers the latency once an I/O has been submitted to the device, but the user cares about the time spent in the kernel, as well. These are shortcomings of the generic blk-stat code which doesn't quite fit the ideal use case for Kyber. So, this replaces the statistics with a histogram used to calculate percentiles of total latency and I/O latency, which we then use to adjust depths in a slightly more intelligent manner: - Sync and async writes are now the same domain. - Discards are a separate domain. - Domain queue depths are scaled by the ratio of the p99 total latency to the target latency (e.g., if the p99 latency is double the target latency, we will double the queue depth; if the p99 latency is half of the target latency, we can halve the queue depth). - We use the I/O latency to determine whether we should scale queue depths down: we will only scale down if any domain's I/O latency exceeds the target latency, which is an indicator of congestion in the device. These new heuristics are just as scalable as the heuristics they replace. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kyber: don't make domain token sbitmap larger than necessaryOmar Sandoval2018-09-281-13/+2
| | | | | | | | | | | | | | The domain token sbitmaps are currently initialized to the device queue depth or 256, whichever is larger, and immediately resized to the maximum depth for that domain (256, 128, or 64 for read, write, and other, respectively). The sbitmap is never resized larger than that, so it's unnecessary to allocate a bitmap larger than the maximum depth. Let's just allocate it to the maximum depth to begin with. This will use marginally less memory, and more importantly, give us a more appropriate number of bits per sbitmap word. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: export blk_stat_enable_accounting()Omar Sandoval2018-09-281-0/+1
| | | | | | | Kyber will need this in a future change if it is built as a module. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move call of scheduler's ->completed_request() hookOmar Sandoval2018-09-284-8/+8
| | | | | | | | | | Commit 4bc6339a583c ("block: move blk_stat_add() to __blk_mq_end_request()") consolidated some calls using ktime_get() so we'd only need to call it once. Kyber's ->completed_request() hook also calls ktime_get(), so let's move it to the same place, too. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Enable support for runtime power managementBart Van Assche2018-09-262-6/+2
| | | | | | | | | | | | | | | Now that the blk-mq core processes power management requests (marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable runtime power management for blk-mq. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Make blk_get_request() block for non-PM requests while suspendedBart Van Assche2018-09-262-34/+47
| | | | | | | | | | | | | | | | | | | | Instead of allowing requests that are not power management requests to enter the queue in runtime suspended status (RPM_SUSPENDED), make the blk_get_request() caller block. This change fixes a starvation issue: it is now guaranteed that power management requests will be executed no matter how many blk_get_request() callers are waiting. For blk-mq, instead of maintaining the q->nr_pending counter, rely on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a request finishes instead of only if the queue depth drops to zero. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Allow unfreezing of a queue while requests are in progressBart Van Assche2018-09-261-1/+1
| | | | | | | | | | | | | | | | A later patch will call blk_freeze_queue_start() followed by blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop to zero. Make sure that this doesn't cause a kernel warning to appear by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The former namely requires that the refcount it operates on is zero. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* percpu-refcount: Introduce percpu_ref_resurrect()Bart Van Assche2018-09-262-2/+27
| | | | | | | | | | | | | | | | This function will be used in a later patch to switch the struct request_queue q_usage_counter from killed back to live. In contrast to percpu_ref_reinit(), this new function does not require that the refcount is zero. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Schedule runtime resume earlierBart Van Assche2018-09-262-2/+2
| | | | | | | | | | | | | | | | | | Instead of scheduling runtime resume of a request queue after a request has been queued, schedule asynchronous resume during request allocation. The new pm_request_resume() calls occur after blk_queue_enter() has increased the q_usage_counter request queue member. This change is needed for a later patch that will make request allocation block while the queue status is not RPM_ACTIVE. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Split blk_pm_add_request() and blk_pm_put_request()Bart Van Assche2018-09-263-5/+33
| | | | | | | | | | | | | | | | | Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into two new functions and thereby separate legacy block layer code from code that works for both the legacy block layer and blk-mq. A later patch will add calls to the new functions in the blk-mq code. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, scsi: Change the preempt-only flag into a counterBart Van Assche2018-09-264-27/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | The RQF_PREEMPT flag is used for three purposes: - In the SCSI core, for making sure that power management requests are executed even if a device is in the "quiesced" state. - For domain validation by SCSI drivers that use the parallel port. - In the IDE driver, for IDE preempt requests. Rename "preempt-only" into "pm-only" because the primary purpose of this mode is power management. Since the power management core may but does not have to resume a runtime suspended device before performing system-wide suspend and since a later patch will set "pm-only" mode as long as a block device is runtime suspended, make it possible to set "pm-only" mode from more than one context. Since with this change scsi_device_quiesce() is no longer idempotent, make that function return early if it is called for a quiesced queue. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Move power management code into a new source fileBart Van Assche2018-09-2611-239/+264
| | | | | | | | | | | | | | | | | | | | Move the code for runtime power management from blk-core.c into the new source file blk-pm.c. Move the corresponding declarations from <linux/blkdev.h> into <linux/blk-pm.h>. For CONFIG_PM=n, leave out the declarations of the functions that are not used in that mode. This patch not only reduces the number of #ifdefs in the block layer core code but also reduces the size of header file <linux/blkdev.h> and hence should help to reduce the build time of the Linux kernel if CONFIG_PM is not defined. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* xen: don't include <xen/xen.h> from <asm/io.h> and <asm/dma-mapping.h>Christoph Hellwig2018-09-269-7/+7
| | | | | | | | | Nothing Xen specific in these headers, which get included from a lot of code in the kernel. So prune the includes and move them to the Xen-specific files that actually use them instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove ARCH_BIOVEC_PHYS_MERGEABLEChristoph Hellwig2018-09-265-15/+3
| | | | | | | | Take the Xen check into the core code instead of delegating it to the architectures. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* xen: provide a prototype for xen_biovec_phys_mergeable in xen.hChristoph Hellwig2018-09-264-10/+4
| | | | | | | | Having multiple externs in arch headers is not a good way to provide a common interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* xen: remove the xen_biovec_phys_mergeable exportChristoph Hellwig2018-09-261-1/+0
| | | | | | | BIOVEC_PHYS_MERGEABLE is only called from core block code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* arm: remove the unused BIOVEC_MERGEABLE defineChristoph Hellwig2018-09-261-7/+0
| | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: don't include bug.h from bio.hChristoph Hellwig2018-09-241-1/+0
| | | | | | | No need to pull in the BUG() defintion. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: don't include io.h from bio.hChristoph Hellwig2018-09-241-3/+0
| | | | | | | | Now that we don't need an override for BIOVEC_PHYS_MERGEABLE there is no need to drag this header in. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove bvec_to_physChristoph Hellwig2018-09-243-8/+3
| | | | | | | | | | We only use it in biovec_phys_mergeable and a m68k paravirt driver, so just opencode it there. Also remove the pointless unsigned long cast for the offset in the opencoded instances. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: merge BIOVEC_SEG_BOUNDARY into biovec_phys_mergeableChristoph Hellwig2018-09-245-48/+17
| | | | | | | | These two checks should always be performed together, so merge them into a single helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: add a missing BIOVEC_SEG_BOUNDARY check in bio_add_pc_pageChristoph Hellwig2018-09-241-1/+3
| | | | | | | | | The actual recaculation of segments in __blk_recalc_rq_segments will do this check, so there is no point in forcing it if we know it won't succeed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: simplify BIOVEC_PHYS_MERGEABLEChristoph Hellwig2018-09-248-30/+28
| | | | | | | | | | | Turn the macro into an inline, move it to blk.h and simplify the arch hooks a bit. Also rename the function to biovec_phys_mergeable as there is no need to shout. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move req_gap_back_merge to blk.hChristoph Hellwig2018-09-242-19/+19
| | | | | | | No need to expose these helpers outside the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move req_gap_{back,front}_merge to blk-merge.cChristoph Hellwig2018-09-242-69/+65
| | | | | | | | Keep it close to the actual users instead of exposing the function to all drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move integrity_req_gap_{back,front}_merge to blk.hChristoph Hellwig2018-09-242-33/+33
| | | | | | | No need to expose these to drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Document the functions that iterate over requestsBart Van Assche2018-09-221-7/+64
| | | | | | | | | | | | | | Make it easier to understand the purpose of the functions that iterate over requests by documenting their purpose. Fix several minor spelling and grammer mistakes in comments in these functions. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: rename blkg_try_get to blkg_trygetDennis Zhou (Facebook)2018-09-224-12/+9
| | | | | | | | | | | blkg reference counting now uses percpu_ref rather than atomic_t. Let's make this consistent with css_tryget. This renames blkg_try_get to blkg_tryget and now returns a bool rather than the blkg or NULL. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: change blkg reference counting to use percpu_refDennis Zhou (Facebook)2018-09-222-35/+44
| | | | | | | | | | Now that every bio is associated with a blkg, this puts the use of blkg_get, blkg_try_get, and blkg_put on the hot path. This switches over the refcnt in blkg to use percpu_ref. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: cleanup and make blk_get_rl use blkg_lookup_createDennis Zhou (Facebook)2018-09-221-13/+21
| | | | | | | | | | | | | | | | | | | blk_get_rl is responsible for identifying which request_list a request should be allocated to. Try get logic was added earlier, but semantically the logic was not changed. This patch makes better use of the bio already having a reference to the blkg in the hot path. The cold path uses a better fallback of blkg_lookup_create rather than just blkg_lookup and then falling back to the q->root_rl. If lookup_create fails with anything but -ENODEV, it falls back to q->root_rl. A clarifying comment is added to explain why q->root_rl is used rather than the root blkg's rl. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: remove additional reference to the cssDennis Zhou (Facebook)2018-09-224-83/+81
| | | | | | | | | | | | | | | | The previous patch in this series removed carrying around a pointer to the css in blkg. However, the blkg association logic still relied on taking a reference on the css to ensure we wouldn't fail in getting a reference for the blkg. Here the implicit dependency on the css is removed. The association continues to rely on the tryget logic walking up the blkg tree. This streamlines the three ways that association can happen: normal, swap, and writeback. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: remove bio->bi_css and instead use bio->bi_blkgDennis Zhou (Facebook)2018-09-228-62/+25
| | | | | | | | | | | | | | Prior patches ensured that all bios are now associated with some blkg. This now makes bio->bi_css unnecessary as blkg maintains a reference to the blkcg already. This patch removes the field bi_css and transfers corresponding uses to access via bi_blkg. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: associate writeback bios with a blkgDennis Zhou (Facebook)2018-09-224-11/+14
| | | | | | | | | | | | One of the goals of this series is to remove a separate reference to the css of the bio. This can and should be accessed via bio_blkcg. In this patch, the wbc_init_bio call is changed such that it must be called after a queue has been associated with the bio. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: associate a blkg for pages being evicted by swapDennis Zhou (Facebook)2018-09-223-28/+68
| | | | | | | | | | | | | A prior patch in this series added blkg association to bios issued by cgroups. There are two other paths that we want to attribute work back to the appropriate cgroup: swap and writeback. Here we modify the way swap tags bios to include the blkg. Writeback will be tackle in the next patch. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: consolidate bio_issue_init to be a part of coreDennis Zhou (Facebook)2018-09-225-10/+13
| | | | | | | | | | | bio_issue_init among other things initializes the timestamp for an IO. Rather than have this logic handled by policies, this consolidates it to be on the init paths (normal, clone, bounce clone). Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: always associate a bio with a blkgDennis Zhou (Facebook)2018-09-225-40/+46
| | | | | | | | | | | | | | | | | Previously, blkg's were only assigned as needed by blk-iolatency and blk-throttle. bio->css was also always being associated while blkg was being looked up and then thrown away in blkcg_bio_issue_check. This patch begins the cleanup of bio->css and bio->bi_blkg by always associating a blkg in blkcg_bio_issue_check. This tries to create the blkg, but if it is not possible, falls back to using the root_blkg of the request_queue. Therefore, a bio will always be associated with a blkg. The duplicate association logic is removed from blk-throttle and blk-iolatency. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: convert blkg_lookup_create to find closest blkgDennis Zhou (Facebook)2018-09-223-15/+41
| | | | | | | | | | | | | | | | There are several scenarios where blkg_lookup_create can fail. Examples include the blkcg dying, request_queue is dying, or simply being OOM. At the end of the day, most handle this by simply falling back to the q->root_blkg and calling it a day. This patch implements the notion of closest blkg. During blkg_lookup_create, if it fails to create, return the closest blkg found or the q->root_blkg. blkg_try_get_closest is introduced and used during association so a bio is always attached to a blkg. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: update blkg_lookup_create to do lockingDennis Zhou (Facebook)2018-09-223-5/+32
| | | | | | | | | | | | | | | | | | | To know when to create a blkg, the general pattern is to do a blkg_lookup and if that fails, lock and then do a lookup again and if that fails finally create. It doesn't make much sense for everyone who wants to do creation to write this themselves. This changes blkg_lookup_create to do locking and implement this pattern. The old blkg_lookup_create is renamed to __blkg_lookup_create. If a call site wants to do its own error handling or already owns the queue lock, they can use __blkg_lookup_create. This will be used in upcoming patches. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: fix ref count issue with bio_blkcg using task_cssDennis Zhou (Facebook)2018-09-226-16/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The accessor function bio_blkcg either returns the blkcg associated with the bio or finds one in the current context. This can cause an issue when trying to associate a bio with a blkcg. Particularly, it's the third case that is problematic: return css_to_blkcg(task_css(current, io_cgrp_id)); As the above may race against task migration and the cgroup exiting, it is not always ok to take a reference on the blkcg returned from bio_blkcg. This patch adds association ahead of calling bio_blkcg rather than after. This makes association a required and explicit step along the code paths for calling bio_blkcg. blk_get_rl is modified as well to get a reference to the blkcg it may use and blk_put_rl will always put the reference back. Association is also moved above the bio_blkcg call to ensure it will not return NULL in blk-iolatency. BFQ and CFQ utilize this flaw, but due to the complexity, I do not want to address this in this series. I've created a private version of the function with notes not to use it describing the flaw. Hopefully soon, that code can be cleaned up. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Blk-throttle: update to use rbtree with leftmost node cachedLiu Bo2018-09-211-26/+15
| | | | | | | | As rbtree has native support of caching leftmost node, i.e. rb_root_cached, no need to do the caching by ourselves. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: use bio_add_page in bio_iov_iter_get_pagesChristoph Hellwig2018-09-201-21/+20
| | | | | | | | | | | | Replace a nasty hack with a different nasty hack to prepare for multipage bio_vecs. By moving the temporary page array as far up as possible in the space allocated for the bio_vec array we can iterate forward over it and thus use bio_add_page. Using bio_add_page means we'll be able to merge physically contiguous pages once support for multipath bio_vecs is merged. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blok, bfq: do not plug I/O if all queues are weight-raisedPaolo Valente2018-09-141-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | To reduce latency for interactive and soft real-time applications, bfq privileges the bfq_queues containing the I/O of these applications. These privileged queues, referred-to as weight-raised queues, get a much higher share of the device throughput w.r.t. non-privileged queues. To preserve this higher share, the I/O of any non-weight-raised queue must be plugged whenever a sync weight-raised queue, while being served, remains temporarily empty. To attain this goal, bfq simply plugs any I/O (from any queue), if a sync weight-raised queue remains empty while in service. Unfortunately, this plugging typically lowers throughput with random I/O, on devices with internal queueing (because it reduces the filling level of the internal queues of the device). This commit addresses this issue by restricting the cases where plugging is performed: if a sync weight-raised queue remains empty while in service, then I/O plugging is performed only if some of the active bfq_queues are *not* weight-raised (which is actually the only circumstance where plugging is needed to preserve the higher share of the throughput of weight-raised queues). This restriction proved able to boost throughput in really many use cases needing only maximum throughput. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: inject other-queue I/O into seeky idle queues on NCQ flashPaolo Valente2018-09-142-6/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Achilles' heel of BFQ is its failing to reach a high throughput with sync random I/O on flash storage with internal queueing, in case the processes doing I/O have differentiated weights. The cause of this failure is as follows. If at least two processes do sync I/O, and have a different weight from each other, then BFQ plugs I/O dispatching every time one of these processes, while it is being served, remains temporarily without pending I/O requests. This plugging is necessary to guarantee that every process enjoys a bandwidth proportional to its weight; but it empties the internal queue(s) of the drive. And this kills throughput with random I/O. So, if some processes have differentiated weights and do both sync and random I/O, the end result is a throughput collapse. This commit tries to counter this problem by injecting the service of other processes, in a controlled way, while the process in service happens to have no I/O. This injection is performed only if the medium is non rotational and performs internal queueing, and the process in service does random I/O (service injection might be beneficial for sequential I/O too, we'll work on that). As an example of the benefits of this commit, on a PLEXTOR PX-256M5S SSD, and with five processes having differentiated weights and doing sync random 4KB I/O, this commit makes the throughput with bfq grow by 400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than that reached with none. As some less random I/O is added to the mix, the throughput becomes equal to or higher than that with none. This commit is a very first attempt to recover throughput without losing control, and certainly has many limitations. One is, e.g., that the processes whose service is injected are not chosen so as to distribute the extra bandwidth they receive in accordance to their weights. Thus there might be loss of weighted fairness in some cases. Anyway, this loss concerns extra service, which would not have been received at all without this commit. Other limitations and issues will probably show up with usage. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: correctly charge and reset entity service in all casesPaolo Valente2018-09-141-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BFQ schedules entities (which represent either per-process queues or groups of queues) as a function of their timestamps. In particular, as a function of their (virtual) finish times. The finish time of an entity is computed as a function of the budget assigned to the entity, assuming, tentatively, that the entity, once in service, will receive an amount of service equal to its budget. Then, when the entity is expired because it finishes to be served, this finish time is updated as a function of the actual service received by the entity. This allows the entity to be correctly charged with only the service received, and then to be correctly re-scheduled. Yet an entity may receive service also while not being the entity in service (in the scheduling environment of its parent entity), for several reasons. If the entity remains with no backlog while receiving this 'unofficial' service, then it is expired. Also on such an expiration, the finish time of the entity should be updated to account for only the service actually received by the entity. Unfortunately, such an update is not performed for an entity expiring without being the entity in service. In a similar vein, the service counter of the entity in service is reset when the entity is expired, to be ready to be used for next service cycle. This reset too should be performed also in case an entity is expired because it remains empty after receiving service while not being the entity in service. But in this case the reset is not performed. This commit performs the above update of the finish time and reset of the service received, also for an entity expiring while not being the entity in service. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-iolatency: remove set but not used variables 'changed' and 'blkiolat'YueHaibing2018-09-141-5/+0
| | | | | | | | | | | | | | | Fixes gcc '-Wunused-but-set-variable' warning: block/blk-iolatency.c: In function 'scale_change': block/blk-iolatency.c:301:7: warning: variable 'changed' set but not used [-Wunused-but-set-variable] block/blk-iolatency.c: In function 'iolatency_set_limit': block/blk-iolatency.c:765:24: warning: variable 'blkiolat' set but not used [-Wunused-but-set-variable] Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* rsxx: Remove unnecessary parenthesesNathan Chancellor2018-09-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Clang warns when more than one set of parentheses is used for a single conditional statement: drivers/block/rsxx/cregs.c:279:15: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] if ((cmd->op == CREG_OP_READ)) { ~~~~~~~~^~~~~~~~~~~~~~~ drivers/block/rsxx/cregs.c:279:15: note: remove extraneous parentheses around the comparison to silence this warning if ((cmd->op == CREG_OP_READ)) { ~ ^ ~ drivers/block/rsxx/cregs.c:279:15: note: use '=' to turn this equality comparison into an assignment if ((cmd->op == CREG_OP_READ)) { ^~ = 1 warning generated. Reported-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: umem: replace spin_lock_bh with spin_lock in tasklet callbackjun qian2018-09-071-2/+2
| | | | | | | As you are already in a tasklet, it is unnecessary to call spin_lock_bh. Signed-off-by: jun qian <hangdianqj@163.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove bio_rewind_iter()Ming Lei2018-09-064-30/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | It is pointed that bio_rewind_iter() is one very bad API[1]: 1) bio size may not be restored after rewinding 2) it causes some bogus change, such as 5151842b9d8732 (block: reset bi_iter.bi_done after splitting bio) 3) rewinding really makes things complicated wrt. bio splitting 4) unnecessary updating of .bi_done in fast path [1] https://marc.info/?t=153549924200005&r=1&w=2 So this patch takes Kent's suggestion to restore one bio into its original state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(), given now bio_rewind_iter() is only used by bio integrity code. Cc: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Hannes Reinecke <hare@suse.com> Suggested-by: Kent Overstreet <kent.overstreet@gmail.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>