summaryrefslogtreecommitdiffstats
path: root/block/blk-mq-sched.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* blk-mq: only run the hardware queue if IO is pendingJens Axboe2017-11-111-6/+1
| | | | | | | | | | | | | | | | | Currently we are inconsistent in when we decide to run the queue. Using blk_mq_run_hw_queues() we check if the hctx has pending IO before running it, but we don't do that from the individual queue run function, blk_mq_run_hw_queue(). This results in a lot of extra and pointless queue runs, potentially, on flush requests and (much worse) on tag starvation situations. This is observable just looking at top output, with lots of kworkers active. For the !async runs, it just adds to the CPU overhead of blk-mq. Move the has-pending check into the run function instead of having callers do it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Revert "blk-mq: don't handle TAG_SHARED in restart"Jens Axboe2017-11-111-4/+74
| | | | | | | | | This reverts commit 358a3a6bccb74da9d63a26b2dd5f09f1e9970e0b. We have cases that aren't covered 100% in the drivers, so for now we have to retain the shared tag restart loops. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: don't allocate driver tag upfront for flush rqMing Lei2017-11-041-37/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The idea behind it is simple: 1) for none scheduler, driver tag has to be borrowed for flush rq, otherwise we may run out of tag, and that causes an IO hang. And get/put driver tag is actually noop for none, so reordering tags isn't necessary at all. 2) for a real I/O scheduler, we need not allocate a driver tag upfront for flush rq. It works just fine to follow the same approach as normal requests: allocate driver tag for each rq just before calling ->queue_rq(). One driver visible change is that the driver tag isn't shared in the flush request sequence. That won't be a problem, since we always do that in legacy path. Then flush rq need not be treated specially wrt. get/put driver tag. This cleans up the code - for instance, reorder_tags_to_front() can be removed, and we needn't worry about request ordering in dispatch list for avoiding I/O deadlock. Also we have to put the driver tag before requeueing. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQMing Lei2017-11-041-13/+16
| | | | | | | | | | | | | | | | | In case of IO scheduler we always pre-allocate one driver tag before calling blk_insert_flush(), and flush request will be marked as RQF_FLUSH_SEQ once it is in flush machinery. So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush() to handle the request, otherwise the flush request is dispatched to ->dispatch list directly. This is a preparation patch for not preallocating a driver tag for flush requests, and for not treating flush requests as a special case. This is similar to what the legacy path does. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: don't handle failure in .get_budgetMing Lei2017-11-041-12/+2
| | | | | | | | | | | | | | | It is enough to just check if we can get the budget via .get_budget(). And we don't need to deal with device state change in .get_budget(). For SCSI, one issue to be fixed is that we have to call scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails to handle the request. And it isn't enough to simply call blk_mq_end_request() to do that if this request is marked as RQF_DONTPREP. Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq) Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCEMing Lei2017-11-011-25/+20
| | | | | | | | | | | SCSI restarts its queue in scsi_end_request() automatically, so we don't need to handle this case in blk-mq. Especailly any request won't be dequeued in this case, we needn't to worry about IO hang caused by restart vs. dispatch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: don't handle TAG_SHARED in restartMing Lei2017-11-011-74/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now restart is used in the following cases, and TAG_SHARED is for SCSI only. 1) .get_budget() returns BLK_STS_RESOURCE - if resource in target/host level isn't satisfied, this SCSI device will be added in shost->starved_list, and the whole queue will be rerun (via SCSI's built-in RESTART) in scsi_end_request() after any request initiated from this host/targe is completed. Forget to mention, host level resource can't be an issue for blk-mq at all. - the same is true if resource in the queue level isn't satisfied. - if there isn't outstanding request on this queue, then SCSI's RESTART can't work(blk-mq's can't work too), and the queue will be run after SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's RESTART when this request is finished 2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE - if there isn't onprogressing request on this queue, the queue will be run after SCSI_QUEUE_DELAY - otherwise, SCSI's RESTART covers the rerun. 3) blk_mq_get_driver_tag() failed - BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver allocation. In one word, SCSI's built-in RESTART is enough to cover the queue rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart. In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when running I/O on these 8 luns concurrently. Aslo Roman Pen reported the current RESTART is very expensive especialy when there are lots of LUNs attached in one host, such as in his test, RESTART causes half of IOPS be cut. Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2 Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq-sched: improve dispatching from sw queueMing Lei2017-11-011-3/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | SCSI devices use host-wide tagset, and the shared driver tag space is often quite big. However, there is also a queue depth for each lun( .cmd_per_lun), which is often small, for example, on both lpfc and qla2xxx, .cmd_per_lun is just 3. So lots of requests may stay in sw queue, and we always flush all belonging to same hw queue and dispatch them all to driver. Unfortunately it is easy to cause queue busy because of the small .cmd_per_lun. Once these requests are flushed out, they have to stay in hctx->dispatch, and no bio merge can happen on these requests, and sequential IO performance is harmed. This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request from a sw queue, so that we can dispatch them in scheduler's way. We can then avoid dequeueing too many requests from sw queue, since we don't flush ->dispatch completely. This patch improves dispatching from sw queue by using the .get_budget and .put_budget callbacks. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: introduce .get_budget and .put_budget in blk_mq_opsMing Lei2017-11-011-10/+45
| | | | | | | | | | | | | | | | | | | | | For SCSI devices, there is often a per-request-queue depth, which needs to be respected before queuing one request. Currently blk-mq always dequeues the request first, then calls .queue_rq() to dispatch the request to lld. One obvious issue with this approach is that I/O merging may not be successful, because when the per-request-queue depth can't be respected, .queue_rq() has to return BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch list. This means it never gets a chance to be merged with other IO. This patch introduces .get_budget and .put_budget callback in blk_mq_ops, then we can try to get reserved budget first before dequeuing request. If the budget for queueing I/O can't be satisfied, we don't need to dequeue request at all. Hence the request can be left in the IO scheduler queue, for more merging opportunities. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq-sched: move actual dispatching into one helperMing Lei2017-11-011-19/+24
| | | | | | | | | | | | | So that it becomes easy to support to dispatch from sw queue in the following patch. No functional change. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatchMing Lei2017-11-011-6/+6
| | | | | | | | | | | | | | | When the hw queue is busy, we shouldn't take requests from the scheduler queue any more, otherwise it is difficult to do IO merge. This patch fixes the awful IO performance on some SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if hw queue is busy. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq-sched: fix performance regression of mq-deadlineMing Lei2017-07-041-3/+5
| | | | | | | | | | | | | | | | | When mq-deadline is taken, IOPS of sequential read and seqential write is observed more than 20% drop on sata(scsi-mq) devices, compared with using 'none' scheduler. The reason is that the default nr_requests for scheduler is too big for small queuedepth devices, and latency is increased much. Since the principle of taking 256 requests for mq scheduler is based on 128 queue depth, this patch changes into double size of min(hw queue_depth, 128). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge commit '8e8320c9315c' into for-4.13/blockJens Axboe2017-06-231-12/+46
|\ | | | | | | | | | | | | | | | | Pull in the fix for shared tags, as it conflicts with the pending changes in for-4.13/block. We already pulled in v4.12-rc5 to solve other conflicts or get fixes that went into 4.12, so not a lot of changes in this merge. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: fix performance regression with shared tagsJens Axboe2017-06-211-12/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have shared tags enabled, then every IO completion will trigger a full loop of every queue belonging to a tag set, and every hardware queue for each of those queues, even if nothing needs to be done. This causes a massive performance regression if you have a lot of shared devices. Instead of doing this huge full scan on every IO, add an atomic counter to the main queue that tracks how many hardware queues have been marked as needing a restart. With that, we can avoid looking for restartable queues, if we don't have to. Max reports that this restores performance. Before this patch, 4K IOPS was limited to 22-23K IOPS. With the patch, we are running at 950-970K IOPS. Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: Document locking assumptionsBart Van Assche2017-06-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Document the locking assumptions in functions that modify blk_mq_ctx.rq_list to make it easier for humans to verify this code. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queueMing Lei2017-06-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is required that no dispatch can happen any more once blk_mq_quiesce_queue() returns, and we don't have such requirement on APIs of stopping queue. But blk_mq_quiesce_queue() still may not block/drain dispatch in the the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical sections for fixing this issue. Also blk_mq_quiesce_queue() is implemented via stopping queue, which limits its uses, and easy to cause race, because any queue restart in other paths may break blk_mq_quiesce_queue(). With the introduced flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue for quiescing any more. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: refactor blk_mq_sched_assign_iocChristoph Hellwig2017-06-181-24/+4
| | | | | | | | | | | | | | | | | | | | blk_mq_sched_assign_ioc now only handles the assigned of the ioc if the schedule needs it (bfq only at the moment). The caller to the per-request initializer is moved out so that it can be merged with a similar call for the kyber I/O scheduler. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: remove blk_mq_sched_{get,put}_rq_privChristoph Hellwig2017-06-181-4/+6
| | | | | | | | | | | | | | | | Having these as separate helpers in a header really does not help readability, or my chances to refactor this code sanely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: move blk_mq_sched_{get,put}_request to blk-mq.cChristoph Hellwig2017-06-181-67/+2
| | | | | | | | | | | | | | | | Having them out of line in blk-mq-sched.c just makes the code flow unnecessarily complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: make per-sw-queue bio merge as default .bio_mergeMing Lei2017-05-261-5/+57
|/ | | | | | | | | | Because what the per-sw-queue bio merge does is basically same with scheduler's .bio_merge(), this patch makes per-sw-queue bio merge as the default .bio_merge if no scheduler is used or io scheduler doesn't provide .bio_merge(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-debugfs: allow schedulers to register debugfs attributesOmar Sandoval2017-05-041-8/+16
| | | | | | | | | | | | | | This provides the infrastructure for schedulers to expose their internal state through debugfs. We add a list of queue attributes and a list of hctx attributes to struct elevator_type and wire them up when switching schedulers. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Add missing seq_file.h header in blk-mq-debugfs.h Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: remove hack that bypasses scheduler for reserved requestsJens Axboe2017-05-021-5/+1
| | | | | | | | | | | We have update the troublesome driver (mtip32xx) to deal with this appropriately. So kill the hack that bypassed scheduler allocation and insertion for reserved requests. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: alloate reserved tags out of normal poolJens Axboe2017-04-271-1/+5
| | | | | | | | | | | | At least one driver, mtip32xx, has a hard coded dependency on the value of the reserved tag used for internal commands. While that should really be fixed up, for now let's ensure that we just bypass the scheduler tags an allocation marked as reserved. They are used for house keeping or error handling, so we can safely ignore them in the scheduler. Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: Remove blk_mq_sched_move_to_dispatch()Bart Van Assche2017-04-211-16/+0
| | | | | | | | | | | commit c13660a08c8b ("blk-mq-sched: change ->dispatch_requests() to ->dispatch_request()") removed the last user of this function. Hence also remove the function itself. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: provide hooks for initializing hardware queue dataOmar Sandoval2017-04-071-38/+43
| | | | | | | | | | Schedulers need to be informed when a hardware queue is added or removed at runtime so they can allocate/free per-hardware queue data. So, replace the blk_mq_sched_init_hctx_data() helper, which only makes sense at init time, with .init_hctx() and .exit_hctx() hooks. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: Restart a single queue if tag sets are sharedBart Van Assche2017-04-071-10/+53
| | | | | | | | | | | | | | To improve scalability, if hardware queues are shared, restart a single hardware queue in round-robin fashion. Rename blk_mq_sched_restart_queues() to reflect the new semantics. Remove blk_mq_sched_mark_restart_queue() because this function has no callers. Remove flag QUEUE_FLAG_RESTART because this patch removes the code that uses this flag. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: fix crash in switch error pathOmar Sandoval2017-04-071-2/+11
| | | | | | | | | | | In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall back to the original scheduler. However, at this point, we've already torn down the original scheduler's tags, so this causes a crash. Doing the fallback like the legacy elevator path is much harder for mq, so fix it by just falling back to none, instead. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: set up scheduler tags when bringing up new queuesOmar Sandoval2017-04-071-0/+22
| | | | | | | | | | If a new hardware queue is added at runtime, we don't allocate scheduler tags for it, leading to a crash. This hooks up the scheduler framework to blk_mq_{init,exit}_hctx() to make sure everything gets properly initialized/freed. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: refactor scheduler initializationOmar Sandoval2017-04-071-38/+44
| | | | | | | | Preparation cleanup for the next couple of fixes, push blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: use the right hctx when getting a driver tag failsOmar Sandoval2017-04-071-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While dispatching requests, if we fail to get a driver tag, we mark the hardware queue as waiting for a tag and put the requests on a hctx->dispatch list to be run later when a driver tag is freed. However, blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware queues if using a single-queue scheduler with a multiqueue device. If blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we are processing. This means we end up using the hardware queue of the previous request, which may or may not be the same as that of the current request. If it isn't, the wrong hardware queue will end up waiting for a tag, and the requests will be on the wrong dispatch list, leading to a hang. The fix is twofold: 1. Make sure we save which hardware queue we were trying to get a request for in blk_mq_get_driver_tag() regardless of whether it succeeds or not. 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a blk_mq_hw_queue to make it clear that it must handle multiple hardware queues, since I've already messed this up on a couple of occasions. This didn't appear in testing with nvme and mq-deadline because nvme has more driver tags than the default number of scheduler tags. However, with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: move update of tags->rqs to __blk_mq_alloc_request()Omar Sandoval2017-03-021-2/+0
| | | | | | | | | No functional difference, it just makes a little more sense to update the tag map where we actually allocate the tag. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Tested-by: Sagi Grimberg <sagi@grimberg.me>
* blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler requestOmar Sandoval2017-03-021-6/+5
| | | | | | | | | | | | | blk_mq_alloc_request_hctx() allocates a driver request directly, unlike its blk_mq_alloc_request() counterpart. It also crashes because it doesn't update the tags->rqs map. Fix it by making it allocate a scheduler request. Reported-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Tested-by: Sagi Grimberg <sagi@grimberg.me>
* blk-mq-sched: Allocate sched reserved tags as specified in the original ↵Sagi Grimberg2017-03-021-1/+2
| | | | | | | | | | | | queue tagset Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Modified by me to also check at driver tag allocation time if the original request was reserved, so we can be sure to allocate a properly reserved tag at that point in time, too. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: separate mark hctx and queue restart operationsOmar Sandoval2017-02-231-12/+8
| | | | | | | | | | | | | | | | | | | | In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart() after we dispatch requests left over on our hardware queue dispatch list. This is so we'll go back and dispatch requests from the scheduler. In this case, it's only necessary to restart the hardware queue that we are running; there's no reason to run other hardware queues just because we are using shared tags. So, split out blk_mq_sched_mark_restart() into two operations, one for just the hardware queue and one for the whole request queue. The core code only needs the hctx variant, but I/O schedulers will want to use both. This also requires adjusting blk_mq_sched_restart_queues() to always check the queue restart flag, not just when using shared tags. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: get rid of blk-mq default scheduler choice Kconfig entriesJens Axboe2017-02-221-9/+0
| | | | | | | | | | The wording in the entries were poor and not understandable by even deities. Kill the selection for default block scheduler, and impose a policy with sane defaults. Architected-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: ask scheduler for work, if we failed dispatching leftoversJens Axboe2017-02-171-3/+12
| | | | | | | | | | | | Usually we don't ask the scheduler for work, if we already have leftovers on the dispatch list. This is done to leave work on the scheduler side for as long as possible, for proper merging. But if we do have work leftover but didn't dispatch anything, then we should ask the scheduler since we could potentially issue requests from that. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* blk-mq-sched: don't add flushes to the head of requeue queueJens Axboe2017-02-171-1/+1
| | | | | | | | | | If we are currently out of driver tags, we don't want to add a new flush (without a tag) to the head of the requeue list. We want to add it to the back, behind the others that are potentially also waiting for a tag. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* blk-mq: pass bio to blk_mq_sched_get_rq_privPaolo Valente2017-02-101-3/+5
| | | | | | | | | | bio is used in bfq-mq's get_rq_priv, to get the request group. We could pass directly the group here, but I thought that passing the bio was more general, giving the possibility to get other pieces of information if needed. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: enumify ELEVATOR_*_MERGEChristoph Hellwig2017-02-081-18/+17
| | | | | | | | Switch these constants to an enum, and make let the compiler ensure that all callers of blk_try_merge and elv_merge handle all potential values. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: free merged request in the callerJens Axboe2017-02-031-3/+6
| | | | | | | | | | | | | | | | | | | If we end up doing a request-to-request merge when we have completed a bio-to-request merge, we free the request from deep down in that path. For blk-mq-sched, the merge path has to hold the appropriate lock, but we don't need it for freeing the request. And in fact holding the lock is problematic, since we are now calling the mq sched put_rq_private() hook with the lock held. Other call paths do not hold this lock. Fix this inconsistency by ensuring that the caller frees a merged request. Then we can do it outside of the lock, making it both more efficient and fixing the blk-mq-sched problem of invoking parts of the scheduler with an unknown lock state. Reported-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* blk-mq-sched: bypass the scheduler for flushes entirelyOmar Sandoval2017-02-031-2/+23
| | | | | | | | | | There's a weird inconsistency that flushes are mostly hidden from the scheduler, but it needs to be aware of them in ->insert_requests(). Instead of having every scheduler call blk_mq_sched_bypass_insert(), let's do it in the common framework. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: cleanup remaining manual checks for PREFLUSH|FUAJens Axboe2017-01-271-1/+1
| | | | | | Use op_is_flush() where applicable. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()Jens Axboe2017-01-271-0/+58
| | | | | | | | | | Instead of letting the caller check this and handle the details of inserting a flush request, put the logic in the scheduler insertion function. This fixes direct flush insertion outside of the usual make_request_fn calls, like from dm via blk_insert_cloned_request(). Signed-off-by: Jens Axboe <axboe@fb.com>
* block: add a op_is_flush helperChristoph Hellwig2017-01-271-3/+2
| | | | | | | | | | This centralizes the checks for bios that needs to be go into the flush state machine. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: change ->dispatch_requests() to ->dispatch_request()Jens Axboe2017-01-271-8/+15
| | | | | | | | | | | | | | | | When we invoke dispatch_requests(), the scheduler empties everything into the passed in list. This isn't always a good thing, since it means that we remove items that we could have potentially merged with. Change the function to dispatch single requests at the time. If we do that, we can backoff exactly at the point where the device can't consume more IO, and leave the rest with the scheduler for better merging and future dispatch decision making. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
* blk-mq-sched: fix starvation for multiple hardware queues and shared tagsJens Axboe2017-01-271-0/+28
| | | | | | | | | | | | | | If we have both multiple hardware queues and shared tag map between devices, we need to ensure that we propagate the hardware queue restart bit higher up. This is because we can get into a situation where we don't have any IO pending on a hardware queue, yet we fail getting a tag to start new IO. If that happens, it's not enough to mark the hardware queue as needing a restart, we need to bubble that up to the higher level queue as well. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
* blk-mq-sched: check for successful allocation before assigning tagJens Axboe2017-01-261-1/+2
| | | | | | | | | | We don't trigger this from the normal IO path, since we always use blocking allocations from there. But Bart saw it testing multipath dm, since that is a heavy user of atomic request allocations in the map and clone path. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: don't lose flags passed in to blk_mq_alloc_request()Jens Axboe2017-01-261-1/+1
| | | | | | | | | | If we come in from blk_mq_alloc_requst() with NOWAIT set in flags, we must ensure that we don't later overwrite that in blk_mq_sched_get_request(). Initialize alloc_data->flags before passing it in. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: allow setting of default IO schedulerJens Axboe2017-01-171-0/+20
| | | | | | | | | | | Add Kconfig entries to manage what devices get assigned an MQ scheduler, and add a blk-mq flag for drivers to opt out of scheduling. The latter is useful for admin type queues that still allocate a blk-mq queue and tag set, but aren't use for normal IO. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* blk-mq-sched: add framework for MQ capable IO schedulersJens Axboe2017-01-171-0/+368
This adds a set of hooks that intercepts the blk-mq path of allocating/inserting/issuing/completing requests, allowing us to develop a scheduler within that framework. We reuse the existing elevator scheduler API on the registration side, but augment that with the scheduler flagging support for the blk-mq interfce, and with a separate set of ops hooks for MQ devices. We split driver and scheduler tags, so we can run the scheduling independently of device queue depth. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>