summaryrefslogtreecommitdiffstats
path: root/block/blk-mq.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* blk-mq: insert rq with DONTPREP to hctx dispatch list when requeueJianchao Wang2019-02-121-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When requeue, if RQF_DONTPREP, rq has contained some driver specific data, so insert it to hctx dispatch list to avoid any merge. Take scsi as example, here is the trace event log (no io scheduler, because RQF_STARTED would prevent merging), kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H] scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test] scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test] kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H] scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0] scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0] kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H] kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H] scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0] (32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP. Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP, the sdb only contained the part of (32768 + 8), then only that part was completed. The lucky thing was that scsi_io_completion detected it and requeued the remaining part. So we didn't get corrupted data. However, the requeue of (32776 + 8) is not expected. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: don't lose track of REQ_INTEGRITY flagMing Lei2019-01-161-1/+2
| | | | | | | | | | | | We need to pass bio->bi_opf after bio intergrity preparing, otherwise the flag of REQ_INTEGRITY may not be set on the allocated request, then breaks block integrity. Fixes: f9afca4d367b ("blk-mq: pass in request/bio flags to queue mapping") Cc: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: make request_to_qc_t publicSagi Grimberg2018-12-181-8/+0
| | | | | | | | | | block consumers will need it for polling requests that are sent with blk_execute_rq_nowait. Also, get rid of blk_tag_to_qc_t and open-code it instead. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* blk-mq: enable IO poll if .nr_queues of type poll > 0Ming Lei2018-12-181-1/+2
| | | | | | | | | | | | | The queue mapping of type poll only exists when set->map[HCTX_TYPE_POLL].nr_queues is bigger than zero, so enhance the constraint by checking .nr_queues of type poll before enabling IO poll. Otherwise IO race & timeout can be observed when running block/007. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()Jens Axboe2018-12-181-8/+8
| | | | | | | | | | | | There's a single user of this function, dm, and dm just wants to check if IO is inflight, not that it's just allocated. This fixes a hang with srp/002 in blktests with dm, where it tries to suspend but waits for inflight IO to finish first. As it checks for just allocated requests, this fails. Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: skip zero-queue maps in blk_mq_map_swqueueMing Lei2018-12-171-0/+3
| | | | | | | | | | | | | | | From 7e849dd9cf37 ("nvme-pci: don't share queue maps"), the mapping table won't be initialized actually if map->nr_queues is zero, so we can't use blk_mq_map_queue_type() to retrieve hctx any more. This way still may cause broken mapping, fix it by skipping zero-queues maps in blk_mq_map_swqueue(). Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: fix dispatch from sw queueMing Lei2018-12-171-10/+19
| | | | | | | | | | | | | | | | | | | | | | When a request is added to rq list of sw queue(ctx), the rq may be from a different type of hctx, especially after multi queue mapping is introduced. So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or blk_mq_dequeue_from_ctx(), one request belonging to other queue type of hctx can be dispatched to current hctx in case that read queue or poll queue is enabled. This patch fixes this issue by introducing per-queue-type list. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Changed by me to not use separately cacheline aligned lists, just place them all in the same cacheline where we had just the one list and lock before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: fix allocation for queue mapping tableMing Lei2018-12-171-1/+1
| | | | | | | | | | | | Type of each element in queue mapping table is 'unsigned int, intead of 'struct blk_mq_queue_map)', so fix it. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: replace and kill blk_mq_request_issue_directlyJianchao Wang2018-12-161-8/+1
| | | | | | | | Replace blk_mq_request_issue_directly with blk_mq_try_issue_directly in blk_insert_cloned_request and kill it as nobody uses it any more. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requestsJianchao Wang2018-12-161-11/+9
| | | | | | | | | | | | It is not necessary to issue request directly with bypass 'true' in blk_mq_sched_insert_requests and handle the non-issued requests itself. Just set bypass to 'false' and let blk_mq_try_issue_directly handle them totally. Remove the blk_rq_can_direct_dispatch check, because blk_mq_try_issue_directly can handle it well.If request is direct-issued unsuccessfully, insert the reset. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: refactor the code of issue request directlyJianchao Wang2018-12-161-49/+54
| | | | | | | | | | | | | | | | | | | Merge blk_mq_try_issue_directly and __blk_mq_try_issue_directly into one interface to unify the interfaces to issue requests directly. The merged interface takes over the requests totally, it could insert, end or do nothing based on the return value of .queue_rq and 'bypass' parameter. Then caller needn't any other handling any more and then code could be cleaned up. And also the commit c616cbee ( blk-mq: punt failed direct issue to dispatch list ) always inserts requests to hctx dispatch list whenever get a BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, this is overkill and will harm the merging. We just need to do that for the requests that has been through .queue_rq. This patch also could fix this. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: return just one value from part_in_flightMikulas Patocka2018-12-101-7/+5
| | | | | | | | | | | | | | The previous patches deleted all the code that needed the second value returned from part_in_flight - now the kernel only uses the first value. Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that it only returns one value. This patch just refactors the code, there's no functional change. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'v4.20-rc6' into for-4.21/blockJens Axboe2018-12-101-3/+4
|\ | | | | | | | | | | | | | | | | Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the two corruption fixes. We're going to be overhauling the direct dispatch path, and we need to do that on top of the changes we made for that in mainline. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: punt failed direct issue to dispatch listJens Axboe2018-12-071-28/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the direct dispatch corruption fix, we permanently disallow direct dispatch of non read/write requests. This works fine off the normal IO path, as they will be retried like any other failed direct dispatch request. But for the blk_insert_cloned_request() that only DM uses to bypass the bottom level scheduler, we always first attempt direct dispatch. For some types of requests, that's now a permanent failure, and no amount of retrying will make that succeed. This results in a livelock. Instead of making special cases for what we can direct issue, and now having to deal with DM solving the livelock while still retaining a BUSY condition feedback loop, always just add a request that has been through ->queue_rq() to the hardware queue dispatch list. These are safe to use as no merging can take place there. Additionally, if requests do have prepped data from drivers, we aren't dependent on them not sharing space in the request structure to safely add them to the IO scheduler lists. This basically reverts ffe81d45322c and is based on a patch from Ming, but with the list insert case covered as well. Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") Cc: stable@vger.kernel.org Suggested-by: Ming Lei <ming.lei@redhat.com> Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Ming Lei <ming.lei@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: fix corruption with direct issueJens Axboe2018-12-051-1/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we attempt a direct issue to a SCSI device, and it returns BUSY, then we queue the request up normally. However, the SCSI layer may have already setup SG tables etc for this particular command. If we later merge with this request, then the old tables are no longer valid. Once we issue the IO, we only read/write the original part of the request, not the new state of it. This causes data corruption, and is most often noticed with the file system complaining about the just read data being invalid: [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) because most of it is garbage... This doesn't happen from the normal issue path, as we will simply defer the request to the hardware queue dispatch list if we fail. Once it's on the dispatch list, we never merge with it. Fix this from the direct issue path by flagging the request as REQ_NOMERGE so we don't change the size of it before issue. See also: https://bugzilla.kernel.org/show_bug.cgi?id=201685 Tested-by: Guenter Roeck <linux@roeck-us.net> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: re-build queue map in case of kdump kernelMing Lei2018-12-081-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now almost all .map_queues() implementation based on managed irq affinity doesn't update queue mapping and it just retrieves the old built mapping, so if nr_hw_queues is changed, the mapping talbe includes stale mapping. And only blk_mq_map_queues() may rebuild the mapping talbe. One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. However, drivers often builds queue mapping before allocating tagset via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set as 1 in case of kdump kernel, so wrong queue mapping is used, and kernel panic[1] is observed during booting. This patch fixes the kernel panic triggerd on nvme by rebulding the mapping table via blk_mq_map_queues(). [1] kernel panic log [ 4.438371] nvme nvme0: 16/0/0 default/read/poll queues [ 4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 4.444681] PGD 0 P4D 0 [ 4.445367] Oops: 0000 [#1] SMP NOPTI [ 4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 [ 4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] [ 4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 [ 4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 [ 4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286 [ 4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001 [ 4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001 [ 4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002 [ 4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538 [ 4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001 [ 4.469220] FS: 0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000 [ 4.471554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0 [ 4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4.477061] PKRU: 55555554 [ 4.477464] Call Trace: [ 4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad [ 4.479595] blk_mq_init_queue+0x32/0x4e [ 4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] [ 4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] [ 4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] [ 4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] [ 4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 [ 4.483930] ? try_to_wake_up+0x38d/0x3b3 [ 4.484478] ? process_one_work+0x179/0x2fc [ 4.485118] process_one_work+0x1d3/0x2fc [ 4.485655] ? rescuer_thread+0x2ae/0x2ae [ 4.486196] worker_thread+0x1e9/0x2be [ 4.486841] kthread+0x115/0x11d [ 4.487294] ? kthread_park+0x76/0x76 [ 4.487784] ret_from_fork+0x3a/0x50 [ 4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables [ 4.489428] Dumping ftrace buffer: [ 4.489939] (ftrace buffer empty) [ 4.490492] CR2: 0000000000000098 [ 4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- Cc: Christoph Hellwig <hch@lst.de> Cc: linux-nvme@lists.infradead.org Cc: David Milburn <dmilburn@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: enable polling by default if a poll map is initalizedChristoph Hellwig2018-12-041-0/+2
| | | | | | | | | | | | | | | | | | If the user did setup polling in the driver we should not require another know in the block layer to enable it. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: remove ->poll_fnChristoph Hellwig2018-12-041-5/+19
| | | | | | | | | | | | | | | | | | This was intended to support users like nvme multipath, but is just getting in the way and adding another indirect call. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: don't call ktime_get_ns() if we don't need itJens Axboe2018-12-031-2/+17
| | | | | | | | | | | | | | | | | | We only need the request fields and the end_io time if we have stats enabled, or if we have a scheduler attached as those may use it for completion time stats. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: use plug for devices that implement ->commits_rqs()Jens Axboe2018-11-291-1/+5
| | | | | | | | | | | | | | | | | | | | | | If we have that hook, we know the driver handles bd->last == true in a smart fashion. If it does, even for multiple hardware queues, it's a good idea to flush batches of requests to the device, if we have batches of requests from the submitter. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: use bd->last == true for list insertsJens Axboe2018-11-291-8/+8
| | | | | | | | | | | | | | | | | | | | | | If we are issuing a list of requests, we know if we're at the last one. If we fail issuing, ensure that we call ->commits_rqs() to flush any potential previous requests. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: add mq_ops->commit_rqs()Jens Axboe2018-11-291-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk-mq passes information to the hardware about any given request being the last that we will issue in this sequence. The point is that hardware can defer costly doorbell type writes to the last request. But if we run into errors issuing a sequence of requests, we may never send the request with bd->last == true set. For that case, we need a hook that tells the hardware that nothing else is coming right now. For failures returned by the drivers ->queue_rq() hook, the driver is responsible for flushing pending requests, if it uses bd->last to optimize that part. This works like before, no changes there. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: improve logic around when to sort a plug listJens Axboe2018-11-291-5/+18
| | | | | | | | | | | | | | | | Only do it if we have requests for multiple queues in the same plug. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: Add a NULL check in blk_mq_free_map_and_requests()Dan Carpenter2018-11-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | I recently found some code which called blk_mq_free_map_and_requests() with a NULL set->tags pointer. I fixed the caller, but it seems like a good idea to add a NULL check here as well. Now we can call: blk_mq_free_tag_set(set); blk_mq_free_tag_set(set); twice in a row and it's harmless. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: fix failure to decrement plug count on single rq removalJens Axboe2018-11-281-1/+3
| | | | | | | | | | | | | | | | If we yank a 'same_queue_rq' request off the plug list, we should also decrement the cached request count. Fixes: 5f0ed774ed29 ("block: sum requests in the plug structure") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: sum requests in the plug structureJens Axboe2018-11-261-11/+5
| | | | | | | | | | | | | | | | | | | | This isn't exactly the same as the previous count, as it includes requests for all devices. But that really doesn't matter, if we have more than the threshold (16) queued up, flush it. It's not worth it to have an expensive list loop for this. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: Simplify request completion stateKeith Busch2018-11-261-3/+1
| | | | | | | | | | | | | | | | | | | | There are no more users relying on blk-mq request states to prevent double completions, so replace the relatively expensive cmpxchg operation with WRITE_ONCE. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: Return true if request was completedKeith Busch2018-11-261-2/+3
| | | | | | | | | | | | | | | | | | | | A driver may have internal state to cleanup if we're pretending a request didn't complete. Return 'false' if the command wasn't actually completed due to the timeout error injection, and true otherwise. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: never redirect polled IO completionsJens Axboe2018-11-261-1/+6
| | | | | | | | | | | | | | | | It's pointless to do so, we are by definition on the CPU we want/need to be, as that's the one waiting for a completion event. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: ensure mq_ops ->poll() is entered at least onceJens Axboe2018-11-261-2/+2
| | | | | | | | | | | | | | | | | | | | Right now we immediately bail if need_resched() is true, but we need to do at least one loop in case we have entries waiting. So just invert the need_resched() check, putting it at the bottom of the loop. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: make blk_poll() take a parameter on whether to spin or notJens Axboe2018-11-261-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | blk_poll() has always kept spinning until it found an IO. This is fine for SYNC polling, since we need to find one request we have pending, but in preparation for ASYNC polling it can be beneficial to just check if we have any entries available or not. Existing callers are converted to pass in 'spin == true', to retain the old behavior. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: remove 'tag' parameter from mq_ops->poll()Jens Axboe2018-11-261-1/+1
| | | | | | | | | | | | | | | | We always pass in -1 now and none of the callers use the tag value, remove the parameter. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: when polling for IO, look for any completionJens Axboe2018-11-261-35/+36
| | | | | | | | | | | | | | | | | | If we want to support async IO polling, then we have to allow finding completions that aren't just for the one we are looking for. Always pass in -1 to the mq_ops->poll() helper, and have that return how many events were found in this poll loop. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: not embed .mq_kobj and ctx->kobj into queue instanceMing Lei2018-11-211-7/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime from block layer's view, actually they don't because userspace may grab one kobject anytime via sysfs. This patch fixes the issue by the following approach: 1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing all ctxs 2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release handler of .mq_kobj 3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that .mq_kobj is always released after all ctxs are freed. This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE is enabled. Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: Remove bio->bi_iocDamien Le Moal2018-11-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bio->bi_ioc is never set so always NULL. Remove references to it in bio_disassociate_task() and in rq_ioc() and delete this field from struct bio. With this change, rq_ioc() always returns current->io_context without the need for a bio argument. Further simplify the code and make it more readable by also removing this helper, which also allows to simplify blk_mq_sched_assign_ioc() by removing its bio argument. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: have ->poll_fn() return number of entries polledJens Axboe2018-11-191-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently only really support sync poll, ie poll with 1 IO in flight. This prepares us for supporting async poll. Note that the returned value isn't necessarily 100% accurate. If poll races with IRQ completion, we assume that the fact that the task is now runnable means we found at least one entry. In reality it could be more than 1, or not even 1. This is fine, the caller will just need to take this into account. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: avoid ordered task state change for polled IOJens Axboe2018-11-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | For the core poll helper, the task state setting don't need to imply any atomics, as it's the current task itself that is being modified and we're not going to sleep. For IRQ driven, the wakeup path have the necessary barriers to not need us using the heavy handed version of the task state setting. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: add queue_is_mq() helperJens Axboe2018-11-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Various spots check for q->mq_ops being non-NULL, but provide a helper to do this instead. Where the ->mq_ops != NULL check is redundant, remove it. Since mq == rq-based now that legacy is gone, get rid of the queue_is_rq_based() and just use queue_is_mq() everywhere. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: remove the lock argument to blk_alloc_queue_nodeChristoph Hellwig2018-11-151-1/+1
| | | | | | | | | | | | | | | | | | With the legacy request path gone there is no real need to override the queue_lock. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: remove the unused lock argument to rq_qos_throttleChristoph Hellwig2018-11-151-1/+1
| | | | | | | | | | | | | | | | Unused now that the legacy request path is gone. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: use atomic bitops for ->queue_flagsChristoph Hellwig2018-11-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | ->queue_flags is generally not set or cleared in the fast path, and also generally set or cleared one flag at a time. Make use of the normal atomic bitops for it so that we don't need to take the queue_lock, which is otherwise mostly unused in the core block layer now. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: remove deadline __deadline manipulation helpersChristoph Hellwig2018-11-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | No users left since the removal of the legacy request interface, we can remove all the magic bit stealing now and make it a normal field. But use WRITE_ONCE/READ_ONCE on the new deadline field, given that we don't seem to have any mechanism to guarantee a new value actually gets seen by other threads. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: remove req->timeout_listChristoph Hellwig2018-11-091-1/+0
| | | | | | | | | | | | | | Unused now that the legacy request path is gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: provide a helper to check if a queue is busyJens Axboe2018-11-081-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | Returns true if the queue currently has requests pending, false if not. DM can use this to replace the atomic_inc/dec they do per device to see if a device is busy. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq-tag: change busy_iter_fn to return whether to continue or notJens Axboe2018-11-081-5/+11
| | | | | | | | | | | | | | | | | | | | We have this functionality in sbitmap, but we don't export it in blk-mq for users of the tags busy iteration. This can be useful for stopping the iteration, if the caller doesn't need to find more requests. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: improve plug list sortingJens Axboe2018-11-071-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we only look at the software queue, but with support for multiple maps, we should also look at the hardware queue. This is important since we'll flush out the request list if either the software queue or hardware queue don't match. This sorts by software queue first, then hardware queue if that differs. Finally we sort by request location like before. This minimizes the flush points per plug list. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: cleanup and improve list insertionJens Axboe2018-11-071-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | It's somewhat strange to have a list insertion function that relies on the fact that the caller has mapped things correctly. Pass in the hardware queue directly for insertion, which makes for a much cleaner interface and implementation. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: cache request hardware queue mappingJens Axboe2018-11-071-13/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | We call blk_mq_map_queue() a lot, at least two times for each request per IO, sometimes more. Since we now have an indirect call as well in that function. cache the mapping so we don't have to re-call blk_mq_map_queue() for the same request multiple times. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: separate number of hardware queues from nr_cpu_idsJens Axboe2018-11-071-7/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | With multiple maps, nr_cpu_ids is no longer the maximum number of hardware queues we support on a given devices. The initializer of the tag_set can have set ->nr_hw_queues larger than the available number of CPUs, since we can exceed that with multiple queue maps. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: support multiple hctx mapsJens Axboe2018-11-071-31/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for the tag set carrying multiple queue maps, and for the driver to inform blk-mq how many it wishes to support through setting set->nr_maps. This adds an mq_ops helper for drivers that support more than 1 map, mq_ops->rq_flags_to_type(). The function takes request/bio flags and CPU, and returns a queue map index for that. We then use the type information in blk_mq_map_queue() to index the map set. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>