summaryrefslogtreecommitdiffstats
path: root/block/blk-mq-sched.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handlingJohn Garry2021-07-281-13/+4
| | | | | | | | | | | | | | | | If the blk_mq_sched_alloc_tags() -> blk_mq_alloc_rqs() call fails, then we call blk_mq_sched_free_tags() -> blk_mq_free_rqs(). It is incorrect to do so, as any rqs would have already been freed in the blk_mq_alloc_rqs() call. Fix by calling blk_mq_free_rq_map() only directly. Fixes: 6917ff0b5bd41 ("blk-mq-sched: refactor scheduler initialization") Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1627378373-148090-1-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk: Fix lock inversion between ioc lock and bfqd lockJan Kara2021-06-251-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lockdep complains about lock inversion between ioc->lock and bfqd->lock: bfqd -> ioc: put_io_context+0x33/0x90 -> ioc->lock grabbed blk_mq_free_request+0x51/0x140 blk_put_request+0xe/0x10 blk_attempt_req_merge+0x1d/0x30 elv_attempt_insert_merge+0x56/0xa0 blk_mq_sched_try_insert_merge+0x4b/0x60 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed blk_mq_sched_insert_requests+0xd6/0x2b0 blk_mq_flush_plug_list+0x154/0x280 blk_finish_plug+0x40/0x60 ext4_writepages+0x696/0x1320 do_writepages+0x1c/0x80 __filemap_fdatawrite_range+0xd7/0x120 sync_file_range+0xac/0xf0 ioc->bfqd: bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed put_io_context_active+0x78/0xb0 -> ioc->lock grabbed exit_io_context+0x48/0x50 do_exit+0x7e9/0xdd0 do_group_exit+0x54/0xc0 To avoid this inversion we change blk_mq_sched_try_insert_merge() to not free the merged request but rather leave that upto the caller similarly to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure to free all the merged requests after dropping bfqd->lock. Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Remove unnecessary elevator operation checksDamien Le Moal2021-06-181-7/+6
| | | | | | | | | | | | | | | | | | | | | | The insert_requests and dispatch_request elevator operations are mandatory for the correct execution of an elevator, and all implemented elevators (bfq, kyber and mq-deadline) implement them. As a result, there is no need to check for these operations before calling them when a queue has an elevator set. This simplifies the code in __blk_mq_sched_dispatch_requests() and blk_mq_sched_insert_request(). To avoid out-of-tree elevators to crash the kernel in case of bad implementation, add a check in elv_register() to verify that these operations are implemented. A small, probably not significant, IOPS improvement of 0.1% is observed with this patch applied (4.117 MIOPS to 4.123 MIOPS, average of 20 fio runs doing 4K random direct reads with psync and 32 jobs). Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210618015922.713999-1-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: fix use-after-free in blk_mq_exit_schedMing Lei2021-06-181-1/+3
| | | | | | | | | | | | | | | | | | | tagset can't be used after blk_cleanup_queue() is returned because freeing tagset usually follows blk_clenup_queue(). Commit d97e594c5166 ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap") adds check on q->tag_set->flags in blk_mq_exit_sched(), and causes use-after-free. Fixes it by using hctx->flags. Reported-by: syzbot+77ba3d171a25c56756ea@syzkaller.appspotmail.com Fixes: d97e594c5166 ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap") Cc: John Garry <john.garry@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: John Garry <john.garry@huawei.com> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210609063046.122843-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Do not pull requests from the scheduler when we cannot dispatch themJan Kara2021-06-031-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provided the device driver does not implement dispatch budget accounting (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls requests from the IO scheduler as long as it is willing to give out any. That defeats scheduling heuristics inside the scheduler by creating false impression that the device can take more IO when it in fact cannot. For example with BFQ IO scheduler on top of virtio-blk device setting blkio cgroup weight has barely any impact on observed throughput of async IO because __blk_mq_do_dispatch_sched() always sucks out all the IO queued in BFQ. BFQ first submits IO from higher weight cgroups but when that is all dispatched, it will give out IO of lower weight cgroups as well. And then we have to wait for all this IO to be dispatched to the disk (which means lot of it actually has to complete) before the IO scheduler is queried again for dispatching more requests. This completely destroys any service differentiation. So grab request tag for a request pulled out of the IO scheduler already in __blk_mq_do_dispatch_sched() and do not pull any more requests if we cannot get it because we are unlikely to be able to dispatch it. That way only single request is going to wait in the dispatch list for some tag to free. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Use request queue-wide tags for tagset-wide sbitmapJohn Garry2021-05-241-13/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tags used for an IO scheduler are currently per hctx. As such, when q->nr_hw_queues grows, so does the request queue total IO scheduler tag depth. This may cause problems for SCSI MQ HBAs whose total driver depth is fixed. Ming and Yanhui report higher CPU usage and lower throughput in scenarios where the fixed total driver tag depth is appreciably lower than the total scheduler tag depth: https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b In that scenario, since the scheduler tag is got first, much contention is introduced since a driver tag may not be available after we have got the sched tag. Improve this scenario by introducing request queue-wide tags for when a tagset-wide sbitmap is used. The static sched requests are still allocated per hctx, as requests are initialised per hctx, as in blk_mq_init_request(..., hctx_idx, ...) -> set->ops->init_request(.., hctx_idx, ...). For simplicity of resizing the request queue sbitmap when updating the request queue depth, just init at the max possible size, so we don't need to deal with the possibly with swapping out a new sbitmap for old if we need to grow. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kyber: fix out of bounds access when preemptedOmar Sandoval2021-05-111-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx for the current CPU again and uses that to get the corresponding Kyber context in the passed hctx. However, the thread may be preempted between the two calls to blk_mq_get_ctx(), and the ctx returned the second time may no longer correspond to the passed hctx. This "works" accidentally most of the time, but it can cause us to read garbage if the second ctx came from an hctx with more ctx's than the first one (i.e., if ctx->index_hw[hctx->type] > hctx->nr_ctx). This manifested as this UBSAN array index out of bounds error reported by Jakub: UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9 index 13106 is out of range for type 'long unsigned int [128]' Call Trace: dump_stack+0xa4/0xe5 ubsan_epilogue+0x5/0x40 __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34 queued_spin_lock_slowpath+0x476/0x480 do_raw_spin_lock+0x1c2/0x1d0 kyber_bio_merge+0x112/0x180 blk_mq_submit_bio+0x1f5/0x1100 submit_bio_noacct+0x7b0/0x870 submit_bio+0xc2/0x3a0 btrfs_map_bio+0x4f0/0x9d0 btrfs_submit_data_bio+0x24e/0x310 submit_one_bio+0x7f/0xb0 submit_extent_page+0xc4/0x440 __extent_writepage_io+0x2b8/0x5e0 __extent_writepage+0x28d/0x6e0 extent_write_cache_pages+0x4d7/0x7a0 extent_writepages+0xa2/0x110 do_writepages+0x8f/0x180 __writeback_single_inode+0x99/0x7f0 writeback_sb_inodes+0x34e/0x790 __writeback_inodes_wb+0x9e/0x120 wb_writeback+0x4d2/0x660 wb_workfn+0x64d/0xa10 process_one_work+0x53a/0xa80 worker_thread+0x69/0x5b0 kthread+0x20b/0x240 ret_from_fork+0x1f/0x30 Only Kyber uses the hctx, so fix it by passing the request_queue to ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can map the queues itself to avoid the mismatch. Fixes: a6088845c2bf ("block: kyber: make kyber more friendly with merging") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds2021-04-291-4/+13
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull SCSI updates from James Bottomley: "This consists of the usual driver updates (ufs, target, tcmu, smartpqi, lpfc, zfcp, qla2xxx, mpt3sas, pm80xx). The major core change is using a sbitmap instead of an atomic for queue tracking" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (412 commits) scsi: target: tcm_fc: Fix a kernel-doc header scsi: target: Shorten ALUA error messages scsi: target: Fix two format specifiers scsi: target: Compare explicitly with SAM_STAT_GOOD scsi: sd: Introduce a new local variable in sd_check_events() scsi: dc395x: Open-code status_byte(u8) calls scsi: 53c700: Open-code status_byte(u8) calls scsi: smartpqi: Remove unused functions scsi: qla4xxx: Remove an unused function scsi: myrs: Remove unused functions scsi: myrb: Remove unused functions scsi: mpt3sas: Fix two kernel-doc headers scsi: fcoe: Suppress a compiler warning scsi: libfc: Fix a format specifier scsi: aacraid: Remove an unused function scsi: core: Introduce enum scsi_disposition scsi: core: Modify the scsi_send_eh_cmnd() return value for the SDEV_BLOCK case scsi: core: Rename scsi_softirq_done() into scsi_complete() scsi: core: Remove an incorrect comment scsi: core: Make the scsi_alloc_sgtables() documentation more accurate ...
| * scsi: blk-mq: Return budget token from .get_budget callbackMing Lei2021-03-041-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SCSI uses a global atomic variable to track queue depth for each LUN/request queue. This doesn't scale well when there are lots of CPU cores and the disk is very fast. It has been observed that IOPS is affected a lot by tracking queue depth via sdev->device_busy in the I/O path. Return budget token from .get_budget callback. The budget token can be passed to driver so that we can replace the atomic variable with sbitmap_queue and alleviate the scaling problems that way. Link: https://lore.kernel.org/r/20210122023317.687987-9-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
* | treewide: Change list_sort to use const pointersSami Tolvanen2021-04-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | list_sort() internally casts the comparison function passed to it to a different type with constant struct list_head pointers, and uses this pointer to call the functions, which trips indirect call Control-Flow Integrity (CFI) checking. Instead of removing the consts, this change defines the list_cmp_func_t type and changes the comparison function types of all list_sort() callers to use const pointers, thus avoiding type mismatches. Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
* | block: Drop leftover references to RQF_SORTEDJean Delvare2021-03-011-5/+1
|/ | | | | | | | | | | | | | Commit a1ce35fa49852db60fc6e268038530be533c5b15 ("block: remove dead elevator code") removed all users of RQF_SORTED. However it is still defined, and there is one reference left to it (which in effect is dead code). Clear it all up. Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: get rid of the trace rq insert wrapperChaitanya Kulkarni2021-02-221-6/+0
| | | | | | | | | | | Get rid of the wrapper for trace_block_rq_insert() and call the function directly. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove the request_queue to argument request based tracepointsChristoph Hellwig2020-12-041-1/+1
| | | | | | | | | | | The request_queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: get rid of the dead flush handle code pathYufen Yu2020-10-091-6/+0
| | | | | | | | | | | | | | After commit 923218f6166a ("blk-mq: don't allocate driver tag upfront for flush rq"), blk_mq_submit_bio() will call blk_insert_flush() directly to handle flush request rather than blk_mq_sched_insert_request() in the case of elevator. Then, all flush request either have set RQF_FLUSH_SEQ flag when call blk_mq_sched_insert_request(), or have inserted into hctx->dispatch. So, remove the dead code path. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move blk_mq_sched_try_merge to blk-merge.cChristoph Hellwig2020-10-061-32/+0
| | | | | | | | Move blk_mq_sched_try_merge to blk-merge.c, which allows to mark a lot of the merge infrastructure static there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Remove unused blk_mq_sched_free_hctx_data()Baolin Wang2020-09-081-15/+0
| | | | | | | | | Now we usually free the hctx->sched_data by e->type->ops.exit_hctx(), and no users will use blk_mq_sched_free_hctx_data() function. Remove it. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Facilitate a shared sbitmap per tagsetJohn Garry2020-09-031-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support multiple reply queues with single hostwide tags. In addition, these drivers want to use interrupt assignment in pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0], CPU hotplug may cause in-flight IO completion to not be serviced when an interrupt is shutdown. That problem is solved in commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline"). However, to take advantage of that blk-mq feature, the HBA HW queuess are required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW queues need to be exposed to the upper layer. In making that transition, the per-SCSI command request tags are no longer unique per Scsi host - they are just unique per hctx. As such, the HBA LLDD would have to generate this tag internally, which has a certain performance overhead. However another problem is that blk-mq assumes the host may accept (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy counter was removed, which would stop the LLDD being sent more than .can_queue commands; however, it should still be ensured that the block layer does not issue more than .can_queue commands to the Scsi host. To solve this problem, introduce a shared sbitmap per blk_mq_tag_set, which may be requested at init time. New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the tagset to indicate whether the shared sbitmap should be used. Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests are still allocated per hctx; the reason for this is that if tags and requests were only allocated for a single hctx - like hctx0 - it may break block drivers which expect a request be associated with a specific hctx, i.e. not always hctx0. This will introduce extra memory usage. This change is based on work originally from Ming Lei in [1] and from Bart's suggestion in [2]. [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/ [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/ [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Pass flags for tag init/freeJohn Garry2020-09-031-3/+8
| | | | | | | | | | | | Pass hctx/tagset flags argument down to blk_mq_init_tags() and blk_mq_free_tags() for selective init/free. For now, make it include the alloc policy flag, which can be evaluated when needed (in blk_mq_init_tags()). Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: use BLK_MQ_NO_TAG for no tagXianting Tian2020-09-021-1/+1
| | | | | | | Replace various magic -1 constants for tags with BLK_MQ_NO_TAG. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Remove blk_mq_attempt_merge() functionBaolin Wang2020-09-021-28/+16
| | | | | | | | | The small blk_mq_attempt_merge() function is only called by __blk_mq_sched_bio_merge(), just open code it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Add a new helper to attempt to merge a bioBaolin Wang2020-09-021-3/+3
| | | | | | | | | | | There are lots of duplicated code when trying to merge a bio from plug list and sw queue, we can introduce a new helper to attempt to merge a bio, which can simplify the blk_bio_list_merge() and blk_attempt_plug_merge(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Move blk_mq_bio_list_merge() into blk-merge.cBaolin Wang2020-09-021-45/+1
| | | | | | | | | Move the blk_mq_bio_list_merge() into blk-merge.c and rename it as a generic name. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTARTMing Lei2020-08-171-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SCHED_RESTART code path is relied to re-run queue for dispatch requests in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding requests to hctx->dispatch. memory barriers have to be used for ordering the following two pair of OPs: 1) adding requests to hctx->dispatch and checking SCHED_RESTART in blk_mq_dispatch_rq_list() 2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch in blk_mq_sched_restart(). Without the added memory barrier, either: 1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in dispatch side or 2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue in dispatch side, meantime checking if there is request in hctx->dispatch from blk_mq_sched_restart() is missed. IO hang in ltp/fs_fill test is reported by kernel test robot: https://lkml.org/lkml/2020/7/26/77 Turns out it is caused by the above out-of-order OPs. And the IO hang can't be observed any more after applying this patch. Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: blk-mq-sched: delete duplicated wordRandy Dunlap2020-08-011-1/+1
| | | | | | | | | Drop the repeated word "to". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Remove unnecessary local variableBaolin Wang2020-07-101-3/+1
| | | | | | | | Remove unnecessary local variable 'ret' in blk_mq_dispatch_hctx_list(). Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: support batching dispatch in case of ioMing Lei2020-06-301-7/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | More and more drivers want to get batching requests queued from block layer, such as mmc, and tcp based storage drivers. Also current in-tree users have virtio-scsi, virtio-blk and nvme. For none, we already support batching dispatch. But for io scheduler, every time we just take one request from scheduler and pass the single request to blk_mq_dispatch_rq_list(). This way makes batching dispatch not possible when io scheduler is applied. One reason is that we don't want to hurt sequential IO performance, becasue IO merge chance is reduced if more requests are dequeued from scheduler queue. Try to support batching dispatch for io scheduler by starting with the following simple approach: 1) still make sure we can get budget before dequeueing request 2) use hctx->dispatch_busy to evaluate if queue is busy, if it is busy we fackback to non-batching dispatch, otherwise dequeue as many as possible requests from scheduler, and pass them to blk_mq_dispatch_rq_list(). Wrt. 2), we use similar policy for none, and turns out that SCSI SSD performance got improved much. In future, maybe we can develop more intelligent algorithem for batching dispatch. Baolin has tested this patch and found that MMC performance is improved[3]. [1] https://lore.kernel.org/linux-block/20200512075501.GF1531898@T590/#r [2] https://lore.kernel.org/linux-block/fe6bd8b9-6ed9-b225-f80c-314746133722@grimberg.me/ [3] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: pass obtained budget count to blk_mq_dispatch_rq_listMing Lei2020-06-301-4/+4
| | | | | | | | | | | | | | | | | | Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare for supporting fully batching submission. With the obtained budget count, it is easier to put extra budgets in case of .queue_rq failure. Meantime remove the old 'got_budget' parameter. Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: pass hctx to blk_mq_dispatch_rq_listMing Lei2020-06-301-8/+6
| | | | | | | | | | | | | | | | All requests in the 'list' of blk_mq_dispatch_rq_list belong to same hctx, so it is better to pass hctx instead of request queue, because blk-mq's dispatch target is hctx instead of request queue. Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: pass request queue into get/put budget callbackMing Lei2020-06-301-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | blk-mq budget is abstract from scsi's device queue depth, and it is always per-request-queue instead of hctx. It can be quite absurd to get a budget from one hctx, then dequeue a request from scheduler queue, and this request may not belong to this hctx, at least for bfq and deadline. So fix the mess and always pass request queue to get/put budget callback. Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: make function '__blk_mq_sched_dispatch_requests' staticZheng Bin2020-04-291-1/+1
| | | | | | | | | | | Fix sparse warnings: block/blk-mq-sched.c:209:5: warning: symbol '__blk_mq_sched_dispatch_requests' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Limit number of items taken from the I/O scheduler in one goSalman Qazi2020-04-241-13/+51
| | | | | | | | | | | | | | | | | | | | | | | | Flushes bypass the I/O scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert. This can happen while a kworker is running hctx->run_work work item and is past the point in blk_mq_sched_dispatch_requests where hctx->dispatch is checked. The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time, because the I/O scheduler can feed an arbitrary number of commands. Since we have only one hctx->run_work, the commands waiting in hctx->dispatch will wait an arbitrary length of time for run_work to be rerun. A similar phenomenon exists with dispatches from the software queue. The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and blk_mq_do_dispatch_ctx and return from the run_work handler and let it rerun. Signed-off-by: Salman Qazi <sqazi@google.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Rerun dispatching in the case of budget contentionDouglas Anderson2020-04-201-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If ever a thread running blk-mq code tries to get budget and fails it immediately stops doing work and assumes that whenever budget is freed up that queues will be kicked and whatever work the thread was trying to do will be tried again. One path where budget is freed and queues are kicked in the normal case can be seen in scsi_finish_command(). Specifically: - scsi_finish_command() - scsi_device_unbusy() - # Decrement "device_busy", AKA release budget - scsi_io_completion() - scsi_end_request() - blk_mq_run_hw_queues() The above is all well and good. The problem comes up when a thread claims the budget but then releases it without actually dispatching any work. Since we didn't schedule any work we'll never run the path of finishing work / kicking the queues. This isn't often actually a problem which is why this issue has existed for a while and nobody noticed. Specifically we only get into this situation when we unexpectedly found that we weren't going to do any work. Code that later receives new work kicks the queues. All good, right? The problem shows up, however, if timing is just wrong and we hit a race. To see this race let's think about the case where we only have a budget of 1 (only one thread can hold budget). Now imagine that a thread got budget and then decided not to dispatch work. It's about to call put_budget() but then the thread gets context switched out for a long, long time. While in this state, any and all kicks of the queue (like the when we received new work) will be no-ops because nobody can get budget. Finally the thread holding budget gets to run again and returns. All the normal kicks will have been no-ops and we have an I/O stall. As you can see from the above, you need just the right timing to see the race. To start with, the only case it happens if we thought we had work, actually managed to get the budget, but then actually didn't have work. That's pretty rare to start with. Even then, there's usually a very small amount of time between realizing that there's no work and putting the budget. During this small amount of time new work has to come in and the queue kick has to make it all the way to trying to get the budget and fail. It's pretty unlikely. One case where this could have failed is illustrated by an example of threads running blk_mq_do_dispatch_sched(): * Threads A and B both run has_work() at the same time with the same "hctx". Imagine has_work() is exact. There's no lock, so it's OK if Thread A and B both get back true. * Thread B gets interrupted for a long time right after it decides that there is work. Maybe its CPU gets an interrupt and the interrupt handler is slow. * Thread A runs, get budget, dispatches work. * Thread A's work finishes and budget is released. * Thread B finally runs again and gets budget. * Since Thread A already took care of the work and no new work has come in, Thread B will get NULL from dispatch_request(). I believe this is specifically why dispatch_request() is allowed to return NULL in the first place if has_work() must be exact. * Thread B will now be holding the budget and is about to call put_budget(), but hasn't called it yet. * Thread B gets interrupted for a long time (again). Dang interrupts. * Now Thread C (maybe with a different "hctx" but the same queue) comes along and runs blk_mq_do_dispatch_sched(). * Thread C won't do anything because it can't get budget. * Finally Thread B will run again and put the budget without kicking any queues. Even though the example above is with blk_mq_do_dispatch_sched() I believe the race is possible any time someone is holding budget but doesn't do work. Unfortunately, the unlikely has become more likely if you happen to be using the BFQ I/O scheduler. BFQ, by design, sometimes returns "true" for has_work() but then NULL for dispatch_request() and stays in this state for a while (currently up to 9 ms). Suddenly you only need one race to hit, not two races in a row. With my current setup this is easy to reproduce in reboot tests and traces have actually shown that we hit a race similar to the one described above. Note that we only need to fix blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx() and not the other places that put budget. In other cases we know that we have work to do on at least one "hctx" and code already exists to kick that "hctx"'s queue. When that work finally finishes all the queues will be kicked using the normal flow. One last note is that (at least in the SCSI case) budget is shared by all "hctx"s that have the same queue. Thus we need to make sure to kick the whole queue, not just re-run dispatching on a single "hctx". Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: insert flush request to the front of dispatch queueMing Lei2020-03-121-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly") may change to add flush request to the tail of dispatch by applying the 'add_head' parameter of blk_mq_sched_insert_request. Turns out this way causes performance regression on NCQ controller because flush is non-NCQ command, which can't be queued when there is any in-flight NCQ command. When adding flush rq to the front of hctx->dispatch, it is easier to introduce extra time to flush rq's latency compared with adding to the tail of dispatch queue because of S_SCHED_RESTART, then chance of flush merge is increased, and less flush requests may be issued to controller. So always insert flush request to the front of dispatch queue just like before applying commit 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly"). Cc: Damien Le Moal <Damien.LeMoal@wdc.com> Cc: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: insert passthrough request into hctx->dispatch directlyMing Lei2020-02-251-7/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | For some reason, device may be in one situation which can't handle FS request, so STS_RESOURCE is always returned and the FS request will be added to hctx->dispatch. However passthrough request may be required at that time for fixing the problem. If passthrough request is added to scheduler queue, there isn't any chance for blk-mq to dispatch it given we prioritize requests in hctx->dispatch. Then the FS IO request may never be completed, and IO hang is caused. So passthrough request has to be added to hctx->dispatch directly for fixing the IO hang. Fix this issue by inserting passthrough request into hctx->dispatch directly together withing adding FS request to the tail of hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert(). Then it becomes consistent with original legacy IO request path, in which passthrough request is always added to q->queue_head. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: move lockdep_assert_held() into elevator_exitMing Lei2019-09-261-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq") removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with lockdep_assert_held() called in blk_mq_sched_free_requests() which is run in failure path of elevator_init_mq(). blk_mq_sched_free_requests() is called in the following 3 functions: elevator_init_mq() elevator_exit() blk_cleanup_queue() In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly by 'mutex_lock(&q->sysfs_lock)'. So moving the lockdep_assert_held() from blk_mq_sched_free_requests() into elevator_exit() for fixing the report by syzbot. Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com Fixed: c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq") Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: remove blk_mq_put_ctx()Bart Van Assche2019-07-031-4/+1
| | | | | | | | | | | | | | | No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends on preemption being disabled for its correctness. Since removing the CPU preemption calls does not measurably affect performance, simplify the blk-mq code by removing the blk_mq_put_ctx() function and also by not disabling preemption in blk_mq_get_ctx(). Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove the bi_phys_segments field in struct bioChristoph Hellwig2019-06-201-11/+15
| | | | | | | | | | | | | | | | | We only need the number of segments in the blk-mq submission path. Remove the field from struct bio, and return it from a variant of blk_queue_split instead of that it can passed as an argument to those functions that need the value. This also means we stop recounting segments except for cloning and partial segments. To keep the number of arguments in this how path down remove pointless struct request_queue arguments from any of the functions that had it and grew a nr_segs argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: remove WARN_ON(!q->elevator) from blk_mq_sched_free_requestsMing Lei2019-06-131-1/+0
| | | | | | | | | | | | | | | | | blk_mq_sched_free_requests() may be called in failure path in which q->elevator may not be setup yet, so remove WARN_ON(!q->elevator) from blk_mq_sched_free_requests for avoiding the false positive. This function is actually safe to call in case of !q->elevator because hctx->sched_tags is checked. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Yi Zhang <yi.zhang@redhat.com> Fixes: c3e2219216c9 ("block: free sched's request pool in blk_cleanup_queue") Reported-by: syzbot+b9d0d56867048c7bcfde@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: free sched's request pool in blk_cleanup_queueMing Lei2019-06-071-3/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In theory, IO scheduler belongs to request queue, and the request pool of sched tags belongs to the request queue too. However, the current tags allocation interfaces are re-used for both driver tags and sched tags, and driver tags is definitely host wide, and doesn't belong to any request queue, same with its request pool. So we need tagset instance for freeing request of sched tags. Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched tags to be freed before calling blk_mq_free_tag_set(). Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue") moves blk_exit_queue into __blk_release_queue for simplying the fast path in generic_make_request(), then causes oops during freeing requests of sched tags in __blk_release_queue(). Fix the above issue by move freeing request pool of sched tags into blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any in-queue requests at that time. Freeing sched tags has to be kept in queue's release handler becasue there might be un-completed dispatch activity which might refer to sched tags. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue") Tested-by: Yi Zhang <yi.zhang@redhat.com> Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: grab .q_usage_counter when queuing request from plug code pathMing Lei2019-05-041-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like aio/io_uring, we need to grab 2 refcount for queuing one request, one is for submission, another is for completion. If the request isn't queued from plug code path, the refcount grabbed in generic_make_request() serves for submission. In theroy, this refcount should have been released after the sumission(async run queue) is done. blk_freeze_queue() works with blk_sync_queue() together for avoiding race between cleanup queue and IO submission, given async run queue activities are canceled because hctx->run_work is scheduled with the refcount held, so it is fine to not hold the refcount when running the run queue work function for dispatch IO. However, if request is staggered into plug list, and finally queued from plug code path, the refcount in submission side is actually missed. And we may start to run queue after queue is removed because the queue's kobject refcount isn't guaranteed to be grabbed in flushing plug list context, then kernel oops is triggered, see the following race: blk_mq_flush_plug_list(): blk_mq_sched_insert_requests() insert requests to sw queue or scheduler queue blk_mq_run_hw_queue Because of concurrent run queue, all requests inserted above may be completed before calling the above blk_mq_run_hw_queue. Then queue can be freed during the above blk_mq_run_hw_queue(). Fixes the issue by grab .q_usage_counter before calling blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is safe because the queue is absolutely alive before inserting request. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: add SPDX tags to block layer files missing licensing informationChristoph Hellwig2019-05-011-0/+1
| | | | | | | | | Various block layer files do not have any licensing information at all. Add SPDX tags for the default kernel GPLv2 license to those. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Revert v5.0 blk_mq_request_issue_directly() changesBart Van Assche2019-04-051-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that have been queued. If that happens when blk_mq_try_issue_directly() is called by the dm-mpath driver then dm-mpath will try to resubmit a request that is already queued and a kernel crash follows. Since it is nontrivial to fix blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly() changes that went into kernel v5.0. This patch reverts the following commits: * d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0. * 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0. * 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: James Smart <james.smart@broadcom.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: <stable@vger.kernel.org> Reported-by: Laurence Oberman <loberman@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: save queue mapping result into ctx directlyJianchao Wang2019-02-011-1/+1
| | | | | | | | | | | | | | | Currently, the queue mapping result is saved in a two-dimensional array. In the hot path, to get a hctx, we need do following: q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]] This isn't very efficient. We could save the queue mapping result into ctx directly with different hctx type, like, ctx->hctxs[type] Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: fix dispatch from sw queueMing Lei2018-12-171-3/+8
| | | | | | | | | | | | | | | | | | | | | | When a request is added to rq list of sw queue(ctx), the rq may be from a different type of hctx, especially after multi queue mapping is introduced. So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or blk_mq_dequeue_from_ctx(), one request belonging to other queue type of hctx can be dispatched to current hctx in case that read queue or poll queue is enabled. This patch fixes this issue by introducing per-queue-type list. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Changed by me to not use separately cacheline aligned lists, just place them all in the same cacheline where we had just the one list and lock before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: mq-deadline: Fix write completion handlingDamien Le Moal2018-12-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a zoned block device using mq-deadline, if a write request for a zone is received while another write was already dispatched for the same zone, dd_dispatch_request() will return NULL and the newly inserted write request is kept in the scheduler queue waiting for the ongoing zone write to complete. With this behavior, when no other request has been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to __blk_mq_free_request() call of blk_mq_sched_restart() to not run the queue when the already dispatched write request completes. The newly dispatched request stays stuck in the scheduler queue until eventually another request is submitted. This problem does not affect SCSI disk as the SCSI stack handles queue restart on request completion. However, this problem is can be triggered the nullblk driver with zoned mode enabled. Fix this by always requesting a queue restart in dd_dispatch_request() if no request was dispatched while WRITE requests are queued. Fixes: 5700f69178e9 ("mq-deadline: Introduce zone locking support") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Add missing export of blk_mq_sched_restart() Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requestsJianchao Wang2018-12-161-5/+3
| | | | | | | | | | | | It is not necessary to issue request directly with bypass 'true' in blk_mq_sched_insert_requests and handle the non-issued requests itself. Just set bypass to 'false' and let blk_mq_try_issue_directly handle them totally. Remove the blk_rq_can_direct_dispatch check, because blk_mq_try_issue_directly can handle it well.If request is direct-issued unsuccessfully, insert the reset. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: fix attempt to assign NULL io_contextJens Axboe2018-11-211-1/+8
| | | | | | | | | | | If the first request allocated and issued by a process is a passhthrough request, we don't set up an IO context for it. Ensure that blk_mq_sched_assign_ioc() ignores a NULL io_context. Fixes: e2b3fa5af70c ("block: Remove bio->bi_ioc") Reported-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Remove bio->bi_iocDamien Le Moal2018-11-201-2/+2
| | | | | | | | | | | | | | | | bio->bi_ioc is never set so always NULL. Remove references to it in bio_disassociate_task() and in rq_ioc() and delete this field from struct bio. With this change, rq_ioc() always returns current->io_context without the need for a bio argument. Further simplify the code and make it more readable by also removing this helper, which also allows to simplify blk_mq_sched_assign_ioc() by removing its bio argument. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove the queue_lock indirectionChristoph Hellwig2018-11-151-2/+2
| | | | | | | | | | | | With the legacy request path gone there is no good reason to keep queue_lock as a pointer, we can always use the embedded lock now. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Fixed floppy and blk-cgroup missing conversions and half done edits. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: cleanup and improve list insertionJens Axboe2018-11-071-7/+1
| | | | | | | | | | | | It's somewhat strange to have a list insertion function that relies on the fact that the caller has mapped things correctly. Pass in the hardware queue directly for insertion, which makes for a much cleaner interface and implementation. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>