summaryrefslogtreecommitdiffstats
path: root/block (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-linus-20190726-2' of git://git.kernel.dk/linux-blockLinus Torvalds2019-07-271-1/+2
|\ | | | | | | | | | | | | | | Pull block DMA segment fix from Jens Axboe: "Here's the virtual boundary segment size fix" * tag 'for-linus-20190726-2' of git://git.kernel.dk/linux-block: block: fix max segment size handling in blk_queue_virt_boundary
| * block: fix max segment size handling in blk_queue_virt_boundaryChristoph Hellwig2019-07-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | We should only set the max segment size to unlimited if we actually have a virt boundary. Otherwise we accidentally clear that limit when called from the SCSI midlayer, which always calls blk_queue_virt_boundary, even if that mask is 0. Fixes: 7ad388d8e4c7 ("scsi: core: add a host / host template field for the virt boundary") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-blockLinus Torvalds2019-07-267-45/+62
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - Several io_uring fixes/improvements: - Blocking fix for O_DIRECT (me) - Latter page slowness for registered buffers (me) - Fix poll hang under certain conditions (me) - Defer sequence check fix for wrapped rings (Zhengyuan) - Mismatch in async inc/dec accounting (Zhengyuan) - Memory ordering issue that could cause stall (Zhengyuan) - Track sequential defer in bytes, not pages (Zhengyuan) - NVMe pull request from Christoph - Set of hang fixes for wbt (Josef) - Redundant error message kill for libahci (Ding) - Remove unused blk_mq_sched_started_request() and related ops (Marcos) - drbd dynamic alloc shash descriptor to reduce stack use (Arnd) - blkcg ->pd_stat() non-debug print (Tejun) - bcache memory leak fix (Wei) - Comment fix (Akinobu) - BFQ perf regression fix (Paolo) * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits) io_uring: ensure ->list is initialized for poll commands Revert "nvme-pci: don't create a read hctx mapping without read queues" nvme: fix multipath crash when ANA is deactivated nvme: fix memory leak caused by incorrect subsystem free nvme: ignore subnqn for ADATA SX6000LNP drbd: dynamically allocate shash descriptor block: blk-mq: Remove blk_mq_sched_started_request and started_request bcache: fix possible memory leak in bch_cached_dev_run() io_uring: track io length in async_list based on bytes io_uring: don't use iov_iter_advance() for fixed buffers block: properly handle IOCB_NOWAIT for async O_DIRECT IO blk-mq: allow REQ_NOWAIT to return an error inline io_uring: add a memory barrier before atomic_read rq-qos: use a mb for got_token rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule rq-qos: don't reset has_sleepers on spurious wakeups rq-qos: fix missed wake-ups in rq_qos_throttle wait: add wq_has_single_sleeper helper block, bfq: check also in-flight I/O in dispatch plugging block: fix sysfs module parameters directory path in comment ...
| * block: blk-mq: Remove blk_mq_sched_started_request and started_requestMarcos Paulo de Souza2019-07-232-11/+0
| | | | | | | | | | | | | | | | | | blk_mq_sched_completed_request is a function that checks if the elevator related to the request has started_request implemented, but currently, none of the available IO schedulers implement started_request, so remove both. Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: allow REQ_NOWAIT to return an error inlineJens Axboe2019-07-221-2/+6
| | | | | | | | | | | | | | | | | | | | | | By default, if a caller sets REQ_NOWAIT and we need to block, we'll return -EAGAIN through the bio->bi_end_io() callback. For some use cases, this makes it hard to use. Allow a caller to ask for inline return of errors related to blocking by also setting REQ_NOWAIT_INLINE. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * rq-qos: use a mb for got_tokenJosef Bacik2019-07-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | Oleg noticed that our checking of data.got_token is unsafe in the cleanup case, and should really use a memory barrier. Use a wmb on the write side, and a rmb() on the read side. We don't need one in the main loop since we're saved by set_current_state(). Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * rq-qos: set ourself TASK_UNINTERRUPTIBLE after we scheduleJosef Bacik2019-07-181-0/+1
| | | | | | | | | | | | | | | | | | In case we get a spurious wakeup we need to make sure to re-set ourselves to TASK_UNINTERRUPTIBLE so we don't busy wait. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * rq-qos: don't reset has_sleepers on spurious wakeupsJosef Bacik2019-07-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If we raced with somebody else getting an inflight counter we could fail to get an inflight counter with no sleepers on the list, and thus need to go to sleep. In this case has_sleepers should be true because we are now relying on the waker to get our inflight counter for us. And in the case of spurious wakeups we'd still want this to be the case. So set has_sleepers to true if we went to sleep to make sure we're woken up the proper way. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * rq-qos: fix missed wake-ups in rq_qos_throttleJosef Bacik2019-07-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We saw a hang in production with WBT where there was only one waiter in the throttle path and no outstanding IO. This is because of the has_sleepers optimization that is used to make sure we don't steal an inflight counter for new submitters when there are people already on the list. We can race with our check to see if the waitqueue has any waiters (this is done locklessly) and the time we actually add ourselves to the waitqueue. If this happens we'll go to sleep and never be woken up because nobody is doing IO to wake us up. Fix this by checking if the waitqueue has a single sleeper on the list after we add ourselves, that way we have an uptodate view of the list. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block, bfq: check also in-flight I/O in dispatch pluggingPaolo Valente2019-07-181-24/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consider a sync bfq_queue Q that remains empty while in service, and suppose that, when this happens, there is a fair amount of already in-flight I/O not belonging to Q. In such a situation, I/O dispatching may need to be plugged (until new I/O arrives for Q), for the following reason. The drive may decide to serve in-flight non-Q's I/O requests before Q's ones, thereby delaying the arrival of new I/O requests for Q (recall that Q is sync). If I/O-dispatching is not plugged, then, while Q remains empty, a basically uncontrolled amount of I/O from other queues may be dispatched too, possibly causing the service of Q's I/O to be delayed even longer in the drive. This problem gets more and more serious as the speed and the queue depth of the drive grow, because, as these two quantities grow, the probability to find no queue busy but many requests in flight grows too. If Q has the same weight and priority as the other queues, then the above delay is unlikely to cause any issue, because all queues tend to undergo the same treatment. So, since not plugging I/O dispatching is convenient for throughput, it is better not to plug. Things change in case Q has a higher weight or priority than some other queue, because Q's service guarantees may simply be violated. For this reason, commit 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios") does plug I/O in such an asymmetric scenario. Plugging minimizes the delay induced by already in-flight I/O, and enables Q to recover the bandwidth it may lose because of this delay. Yet the above commit does not cover the case of weight-raised queues, for efficiency concerns. For weight-raised queues, I/O-dispatch plugging is activated simply if not all bfq_queues are weight-raised. But this check does not handle the case of in-flight requests, because a bfq_queue may become non busy *before* all its in-flight requests are completed. This commit performs I/O-dispatch plugging for weight-raised queues if there are some in-flight requests. As a practical example of the resulting recover of control, under write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5 seconds after this fix, against 15 seconds before the fix (as a reference, gnome-terminal takes about 35 seconds to start with any of the other I/O schedulers). Fixes: 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: fix sysfs module parameters directory path in commentAkinobu Mita2019-07-161-1/+1
| | | | | | | | | | | | | | | | | | The runtime configurable module parameter files are located under /sys/module/MODULENAME/parameters, not /sys/module/MODULENAME. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blkcg: allow blkcg_policy->pd_stat() to print non-debug info tooTejun Heo2019-07-162-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | Currently, ->pd_stat() is called only when moduleparam blkcg_debug_stats is set which prevents it from printing non-debug policy-specific statistics. Let's move debug testing down so that ->pd_stat() can print non-debug stat too. This patch doesn't cause any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'docs/v5.3-1' of ↵Linus Torvalds2019-07-168-9/+9
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull rst conversion of docs from Mauro Carvalho Chehab: "As agreed with Jon, I'm sending this big series directly to you, c/c him, as this series required a special care, in order to avoid conflicts with other trees" * tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits) docs: kbuild: fix build with pdf and fix some minor issues docs: block: fix pdf output docs: arm: fix a breakage with pdf output docs: don't use nested tables docs: gpio: add sysfs interface to the admin-guide docs: locking: add it to the main index docs: add some directories to the main documentation index docs: add SPDX tags to new index files docs: add a memory-devices subdir to driver-api docs: phy: place documentation under driver-api docs: serial: move it to the driver-api docs: driver-api: add remaining converted dirs to it docs: driver-api: add xilinx driver API documentation docs: driver-api: add a series of orphaned documents docs: admin-guide: add a series of orphaned documents docs: cgroup-v1: add it to the admin-guide book docs: aoe: add it to the driver-api book docs: add some documentation dirs to the driver-api book docs: driver-model: move it to the driver-api book docs: lp855x-driver.rst: add it to the driver-api book ...
| * docs: admin-guide: add a series of orphaned documentsMauro Carvalho Chehab2019-07-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | There are lots of documents that belong to the admin-guide but are on random places (most under Documentation root dir). Move them to the admin guide. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
| * docs: cgroup-v1: add it to the admin-guide bookMauro Carvalho Chehab2019-07-151-1/+1
| | | | | | | | | | | | Those files belong to the admin guide, so add them. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
| * docs: block: convert to ReSTMauro Carvalho Chehab2019-07-157-7/+7
| | | | | | | | | | | | | | | | | | | | | | Rename the block documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
* | block: Limit zone array allocation sizeDamien Le Moal2019-07-121-16/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Limit the size of the struct blk_zone array used in blk_revalidate_disk_zones() to avoid memory allocation failures leading to disk revalidation failure. Also further reduce the likelyhood of such failures by using kvcalloc() (that is vmalloc()) instead of allocating contiguous pages with alloc_pages(). Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: Kill gfp_t argument of blkdev_report_zones()Damien Le Moal2019-07-121-12/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In preparation of using vmalloc() for large report buffer and zone array allocations used by this function, remove its "gfp_t gfp_mask" argument and rely on the caller context to use memalloc_noio_save/restore() where necessary (block layer zone revalidation and dm-zoned I/O error path). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: Allow mapping of vmalloc-ed buffersDamien Le Moal2019-07-121-1/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To allow the SCSI subsystem scsi_execute_req() function to issue requests using large buffers that are better allocated with vmalloc() rather than kmalloc(), modify bio_map_kern() to allow passing a buffer allocated with vmalloc(). To do so, detect vmalloc-ed buffers using is_vmalloc_addr(). For vmalloc-ed buffers, flush the buffer using flush_kernel_vmap_range(), use vmalloc_to_page() instead of virt_to_page() to obtain the pages of the buffer, and invalidate the buffer addresses with invalidate_kernel_vmap_range() on completion of read BIOs. This last point is executed using the function bio_invalidate_vmalloc_pages() which is defined only if the architecture defines ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE, that is, if the architecture actually needs the invalidation done. Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block/bio-integrity: fix a memory leak bugWenwen Wang2019-07-121-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In bio_integrity_prep(), a kernel buffer is allocated through kmalloc() to hold integrity metadata. Later on, the buffer will be attached to the bio structure through bio_integrity_add_page(), which returns the number of bytes of integrity metadata attached. Due to unexpected situations, bio_integrity_add_page() may return 0. As a result, bio_integrity_prep() needs to be terminated with 'false' returned to indicate this error. However, the allocated kernel buffer is not freed on this execution path, leading to a memory leak. To fix this issue, free the allocated buffer before returning from bio_integrity_prep(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: Disable write plugging for zoned block devicesDamien Le Moal2019-07-103-2/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simultaneously writing to a sequential zone of a zoned block device from multiple contexts requires mutual exclusion for BIO issuing to ensure that writes happen sequentially. However, even for a well behaved user correctly implementing such synchronization, BIO plugging may interfere and result in BIOs from the different contextx to be reordered if plugging is done outside of the mutual exclusion section, e.g. the plug was started by a function higher in the call chain than the function issuing BIOs. Context A Context B | blk_start_plug() | ... | seq_write_zone() | mutex_lock(zone) | bio-0->bi_iter.bi_sector = zone->wp | zone->wp += bio_sectors(bio-0) | submit_bio(bio-0) | bio-1->bi_iter.bi_sector = zone->wp | zone->wp += bio_sectors(bio-1) | submit_bio(bio-1) | mutex_unlock(zone) | return | -----------------------> | seq_write_zone() | mutex_lock(zone) | bio-2->bi_iter.bi_sector = zone->wp | zone->wp += bio_sectors(bio-2) | submit_bio(bio-2) | mutex_unlock(zone) | <------------------------- | | blk_finish_plug() In the above example, despite the mutex synchronization ensuring the correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being issued after BIO 2 of context B, when the plug is released with blk_finish_plug(). While this problem can be addressed using the blk_flush_plug_list() function (in the above example, the call must be inserted before the zone mutex lock is released), a simple generic solution in the block layer avoid this additional code in all zoned block device user code. The simple generic solution implemented with this patch is to introduce the internal helper function blk_mq_plug() to access the current context plug on BIO submission. This helper returns the current plug only if the target device is not a zoned block device or if the BIO to be plugged is not a write operation. Otherwise, the caller context plug is ignored and NULL returned, resulting is all writes to zoned block device to never be plugged. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-throttle: fix zero wait time for iops throttled groupKonstantin Khlebnikov2019-07-101-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops limit is enforced") wait time could be zero even if group is throttled and cannot issue requests right now. As a result throtl_select_dispatch() turns into busy-loop under irq-safe queue spinlock. Fix is simple: always round up target time to the next throttle slice. Fixes: 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops limit is enforced") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: Fix potential overflow in blk_report_zones()Damien Le Moal2019-07-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For large values of the number of zones reported and/or large zone sizes, the sector increment calculated with blk_queue_zone_sectors(q) * n in blk_report_zones() loop can overflow the unsigned int type used for the calculation as both "n" and blk_queue_zone_sectors() value are unsigned int. E.g. for a device with 256 MB zones (524288 sectors), overflow happens with 8192 or more zones reported. Changing the return type of blk_queue_zone_sectors() to sector_t, fixes this problem and avoids overflow problem for all other callers of this helper too. The same change is also applied to the bdev_zone_sectors() helper. Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blkcg: implement REQ_CGROUP_PUNTTejun Heo2019-07-102-0/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a shared kthread needs to issue a bio for a cgroup, doing so synchronously can lead to priority inversions as the kthread can be trapped waiting for that cgroup. This patch implements REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing to a dedicated per-blkcg work item to avoid such priority inversions. This will be used to fix priority inversions in btrfs compression and should be generally useful as we grow filesystem support for comprehensive IO control. Cc: Chris Mason <clm@fb.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usagesTejun Heo2019-07-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | btrfs is going to use css_put() and wbc helpers to improve cgroup writeback support. Add dummy css_get() definition and export wbc helpers to prepare for module and !CONFIG_CGROUP builds. Reported-by: kbuild test robot <lkp@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-cgroup: turn on psi memstall stuffJosef Bacik2019-07-101-5/+7
| | | | | | | | | | | | | | | | | | With the psi stuff in place we can use the memstall flag to indicate pressure that happens from throttling. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: init flush rq ref count to 1Josef Bacik2019-07-101-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | We discovered a problem in newer kernels where a disconnect of a NBD device while the flush request was pending would result in a hang. This is because the blk mq timeout handler does if (!refcount_inc_not_zero(&rq->ref)) return true; to determine if it's ok to run the timeout handler for the request. Flush_rq's don't have a ref count set, so we'd skip running the timeout handler for this request and it would just sit there in limbo forever. Fix this by always setting the refcount of any request going through blk_init_rq() to 1. I tested this with a nbd-server that dropped flush requests to verify that it hung, and then tested with this patch to verify I got the timeout as expected and the error handling kicked in. Thanks, Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-blockLinus Torvalds2019-07-0922-818/+1342
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: "This is the main block updates for 5.3. Nothing earth shattering or major in here, just fixes, additions, and improvements all over the map. This contains: - Series of documentation fixes (Bart) - Optimization of the blk-mq ctx get/put (Bart) - null_blk removal race condition fix (Bob) - req/bio_op() cleanups (Chaitanya) - Series cleaning up the segment accounting, and request/bio mapping (Christoph) - Series cleaning up the page getting/putting for bios (Christoph) - block cgroup cleanups and moving it to where it is used (Christoph) - block cgroup fixes (Tejun) - Series of fixes and improvements to bcache, most notably a write deadlock fix (Coly) - blk-iolatency STS_AGAIN and accounting fixes (Dennis) - Series of improvements and fixes to BFQ (Douglas, Paolo) - debugfs_create() return value check removal for drbd (Greg) - Use struct_size(), where appropriate (Gustavo) - Two lighnvm fixes (Heiner, Geert) - MD fixes, including a read balance and corruption fix (Guoqing, Marcos, Xiao, Yufen) - block opal shadow mbr additions (Jonas, Revanth) - sbitmap compare-and-exhange improvemnts (Pavel) - Fix for potential bio->bi_size overflow (Ming) - NVMe pull requests: - improved PCIe suspent support (Keith Busch) - error injection support for the admin queue (Akinobu Mita) - Fibre Channel discovery improvements (James Smart) - tracing improvements including nvmetc tracing support (Minwoo Im) - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya Kulkarni)" - Various little fixes and improvements to drivers and core" * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits) blk-iolatency: fix STS_AGAIN handling block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES blk-mq: simplify blk_mq_make_request() blk-mq: remove blk_mq_put_ctx() sbitmap: Replace cmpxchg with xchg block: fix .bi_size overflow block: sed-opal: check size of shadow mbr block: sed-opal: ioctl for writing to shadow mbr block: sed-opal: add ioctl for done-mark of shadow mbr block: never take page references for ITER_BVEC direct-io: use bio_release_pages in dio_bio_complete block_dev: use bio_release_pages in bio_unmap_user block_dev: use bio_release_pages in blkdev_bio_end_io iomap: use bio_release_pages in iomap_dio_bio_end_io block: use bio_release_pages in bio_map_user_iov block: use bio_release_pages in bio_unmap_user block: optionally mark pages dirty in bio_release_pages block: move the BIO_NO_PAGE_REF check into bio_release_pages block: skd_main.c: Remove call to memset after dma_alloc_coherent block: mtip32xx: Remove call to memset after dma_alloc_coherent ...
| * blk-iolatency: fix STS_AGAIN handlingDennis Zhou2019-07-051-37/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The iolatency controller is based on rq_qos. It increments on rq_qos_throttle() and decrements on either rq_qos_cleanup() or rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where blk_mq_make_request() may call both rq_qos_cleanup() and rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the double decrement. The above works upstream as the only way we can get STS_AGAIN is from blk_mq_get_request() failing. The STS_AGAIN handling isn't a real problem as bio_endio() skipping only happens on reserved tag allocation failures which can only be caused by driver bugs and already triggers WARN. However, the fix creates a not so great dependency on how STS_AGAIN can be propagated. Internally, we (Facebook) carry a patch that kills read ahead if a cgroup is io congested or a fatal signal is pending. This combined with chained bios progagate their bi_status to the parent is not already set can can cause the parent bio to not clean up properly even though it was successful. This consequently leaks the inflight counter and can hang all IOs under that blkg. To nip the adverse interaction early, this removes the rq_qos_cleanup() callback in iolatency in favor of cleaning up always on the rq_qos_done_bio() path. Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios") Debugged-by: Tejun Heo <tj@kernel.org> Debugged-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROESChristoph Hellwig2019-07-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | Fix a regression introduced when removing bi_phys_segments for Write Zeroes requests, which need to have a segment count of zero, as they don't have a payload. Fixes: 14ccb66b3f58 ("block: remove the bi_phys_segments field in struct bio") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: simplify blk_mq_make_request()Bart Van Assche2019-07-031-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | Move the blk_mq_bio_to_request() call in front of the if-statement. Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove blk_mq_put_ctx()Bart Van Assche2019-07-035-32/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends on preemption being disabled for its correctness. Since removing the CPU preemption calls does not measurably affect performance, simplify the blk-mq code by removing the blk_mq_put_ctx() function and also by not disabling preemption in blk_mq_get_ctx(). Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: fix .bi_size overflowMing Lei2019-07-011-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1 bytes. Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can include very limited pages, and usually at most 256, so the fs bio size won't be bigger than 1M bytes most of times. Since we support multi-page bvec, in theory one fs bio really can be added > 1M pages, especially in case of hugepage, or big writeback with too many dirty pages. Then there is chance in which .bi_size is overflowed. Fixes this issue by using bio_full() to check if the added segment may overflow .bi_size. Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 07173c3ec276 ("block: enable multipage bvecs") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Merge tag 'v5.2-rc6' into for-5.3/blockJens Axboe2019-07-011-17/+21
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak fix. Otherwise we end up having conflicts with future patches between for-5.3/block and master that touch this area. In particular, it makes the bio_full() fix hard to backport to stable. * tag 'v5.2-rc6': (482 commits) Linux 5.2-rc6 Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock" Bluetooth: Fix regression with minimum encryption key size alignment tcp: refine memory limit test in tcp_fragment() x86/vdso: Prevent segfaults due to hoisted vclock reads SUNRPC: Fix a credential refcount leak Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE" net :sunrpc :clnt :Fix xps refcount imbalance on the error path NFS4: Only set creation opendata if O_CREAT ARM: 8867/1: vdso: pass --be8 to linker if necessary KVM: nVMX: reorganize initial steps of vmx_set_nested_state KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries habanalabs: use u64_to_user_ptr() for reading user pointers nfsd: replace Jeff by Chuck as nfsd co-maintainer inet: clear num_timeout reqsk_alloc() PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present net: mvpp2: debugfs: Add pmap to fs dump ipv6: Default fib6_type to RTN_UNICAST when not set net: hns3: Fix inconsistent indenting net/af_iucv: always register net_device notifier ...
| * | block: sed-opal: check size of shadow mbrJonas Rabenstein2019-06-292-0/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check whether the shadow mbr does fit in the provided space on the target. Also a proper firmware should handle this case and return an error we may prevent problems or even damage with crappy firmwares. Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: sed-opal: ioctl for writing to shadow mbrJonas Rabenstein2019-06-291-2/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow modification of the shadow mbr. If the shadow mbr is not marked as done, this data will be presented read only as the device content. Only after marking the shadow mbr as done and unlocking a locking range the actual content is accessible. Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: sed-opal: add ioctl for done-mark of shadow mbrJonas Rabenstein2019-06-291-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable users to mark the shadow mbr as done without completely deactivating the shadow mbr feature. This may be useful on reboots, when the power to the disk is not disconnected in between and the shadow mbr stores the required boot files. Of course, this saves also the (few) commands required to enable the feature if it is already enabled and one only wants to mark the shadow mbr as done. Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: never take page references for ITER_BVECChristoph Hellwig2019-06-291-13/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we pass pages through an iov_iter we always already have a reference in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take reference to pages by default for bvec backed iov_iters. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: use bio_release_pages in bio_map_user_iovChristoph Hellwig2019-06-291-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use bio_release_pages instead of open coding it. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: use bio_release_pages in bio_unmap_userChristoph Hellwig2019-06-291-19/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use bio_release_pages instead of open coding it. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: optionally mark pages dirty in bio_release_pagesChristoph Hellwig2019-06-291-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A lot of callers of bio_release_pages also want to mark the released pages as dirty. Add a mark_dirty parameter to avoid a second relatively expensive bio_for_each_segment_all loop. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: move the BIO_NO_PAGE_REF check into bio_release_pagesChristoph Hellwig2019-06-291-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the BIO_NO_PAGE_REF check into bio_release_pages instead of duplicating it in both callers. Also make the function available outside of bio.c so that we can reuse it in other direct I/O implementations. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: sed-opal: "Never True" conditionsRevanth Rajashekar2019-06-291-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | 'who' an unsigned variable in stucture opal_session_info can never be lesser than zero. Hence, the condition "who < OPAL_ADMIN1" can never be true. Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: sed-opal: PSID reverttper capabilityRevanth Rajashekar2019-06-291-4/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PSID is a 32 character password printed on the drive label, to prove its physical access. This PSID reverttper function is very useful to regain the control over the drive when it is locked and the user can no longer access it because of some failures. However, *all the data on the drive is completely erased*. This method is advisable only when the user is exhausted of all other recovery methods. PSID capabilities are described in: https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_PSID_v1.00_r1.00.pdf Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: NULL out the bic when it's no longer validDouglas Anderson2019-06-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: Remove unused codeDamien Le Moal2019-06-271-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | bio_flush_dcache_pages() is unused. Remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: Init saved_wr_start_at_switch_to_srt in unlikely caseDouglas Anderson2019-06-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some debug code suggested by Paolo was tripping when I did reboot stress tests. Specifically in bfq_bfqq_resume_state() "bic->saved_wr_start_at_switch_to_srt" was later than the current value of "jiffies". A bit of debugging showed that "bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more debugging showed that was because we had run through the "unlikely" case in the bfq_bfqq_save_state() function. Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to something sane. NOTE: this fixes no known real-world errors. Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: re-schedule empty queues if they deserve I/O pluggingPaolo Valente2019-06-251-184/+203
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consider, on one side, a bfq_queue Q that remains empty while in service, and, on the other side, the pending I/O of bfq_queues that, according to their timestamps, have to be served after Q. If an uncontrolled amount of I/O from the latter bfq_queues were dispatched while Q is waiting for its new I/O to arrive, then Q's bandwidth guarantees would be violated. To prevent this, I/O dispatch is plugged until Q receives new I/O (except for a properly controlled amount of injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging, for the following reason. Preemption is performed in two steps. First, Q is expired and re-scheduled. Second, the new bfq_queue to serve is chosen. The first step is needed by the second, as the second can be performed only after Q's timestamps have been properly updated (done in the expiration step), and Q has been re-queued for service. This dependency is a consequence of the way how BFQ's scheduling algorithm is currently implemented. But Q is not re-scheduled at all in the first step, because Q is empty. As a consequence, an uncontrolled amount of I/O may be dispatched until Q becomes non empty again. This breaks Q's service guarantees. This commit addresses this issue by re-scheduling Q even if it is empty. This in turn breaks the assumption that all scheduled queues are non empty. Then a few extra checks are now needed. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: preempt lower-weight or lower-priority queuesPaolo Valente2019-06-251-20/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BFQ enqueues the I/O coming from each process into a separate bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be served for at most timeout_sync milliseconds (default: 125 ms). This service scheme is prone to the following inaccuracy. While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may receive I/O, and, according to BFQ's scheduling policy, may become the right bfq_queue to serve, in place of the currently in-service bfq_queue. In this respect, postponing the service of Q2 to after the service of Q1 finishes may delay the completion of Q2's I/O, compared with an ideal service in which all non-empty bfq_queues are served in parallel, and every non-empty bfq_queue is served at a rate proportional to the bfq_queue's weight. This additional delay is equal at most to the time Q1 may unjustly remain in service before switching to Q2. If Q1 and Q2 have the same weight, then this time is most likely negligible compared with the completion time to be guaranteed to Q2's I/O. In addition, first, one of the reasons why BFQ may want to serve Q1 for a while is that this boosts throughput and, second, serving Q1 longer reduces BFQ's overhead. As a conclusion, it is usually better not to preempt Q1 if both Q1 and Q2 have the same weight. In contrast, as Q2's weight or priority becomes higher and higher compared with that of Q1, the above delay becomes larger and larger, compared with the I/O completion times that have to be guaranteed to Q2 according to Q2's weight. So reducing this delay may be more important than avoiding the costs of preempting Q1. Accordingly, this commit preempts Q1 if Q2 has a higher weight or a higher priority than Q1. Preemption causes Q1 to be re-scheduled, and triggers a new choice of the next bfq_queue to serve. If Q2 really is the next bfq_queue to serve, then Q2 will be set in service immediately. This change reduces the component of the I/O latency caused by the above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5 SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for a process doing sporadic random reads while another process is doing continuous sequential reads. Signed-off-by: Nicola Bottura <bottura.nicola95@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: detect wakers and unconditionally inject their I/OPaolo Valente2019-06-252-34/+261
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A bfq_queue Q may happen to be synchronized with another bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to receive new I/O. We call Q2 "waker queue". If I/O plugging is being performed for Q, and Q is not receiving any more I/O because of the above synchronization, then, thanks to BFQ's injection mechanism, the waker queue is likely to get served before the I/O-plugging timeout fires. Unfortunately, this fact may not be sufficient to guarantee a high throughput during the I/O plugging, because the inject limit for Q may be too low to guarantee a lot of injected I/O. In addition, the duration of the plugging, i.e., the time before Q finally receives new I/O, may not be minimized, because the waker queue may happen to be served only after other queues. To address these issues, this commit introduces the explicit detection of the waker queue, and the unconditional injection of a pending I/O request of the waker queue on each invocation of bfq_dispatch_request(). One may be concerned that this systematic injection of I/O from the waker queue delays the service of Q's I/O. Fortunately, it doesn't. On the contrary, next Q's I/O is brought forward dramatically, for it is not blocked for milliseconds. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>