diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2019-09-18 01:57:47 +0200 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2019-09-18 01:57:47 +0200 |
commit | 7ad67ca5534ee7c958559c4ad610f05c4578e361 (patch) | |
tree | dc6b6a8a6b70b5f25b07bcdc06d8e77e705f6822 | |
parent | Merge tag 'for-5.4/libata-2019-09-15' of git://git.kernel.dk/linux-block (diff) | |
parent | null_blk: format pr_* logs with pr_fmt (diff) | |
download | linux-7ad67ca5534ee7c958559c4ad610f05c4578e361.tar.xz linux-7ad67ca5534ee7c958559c4ad610f05c4578e361.zip |
Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Two NVMe pull requests:
- ana log parse fix from Anton
- nvme quirks support for Apple devices from Ben
- fix missing bio completion tracing for multipath stack devices
from Hannes and Mikhail
- IP TOS settings for nvme rdma and tcp transports from Israel
- rq_dma_dir cleanups from Israel
- tracing for Get LBA Status command from Minwoo
- Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
- Some consolidation between the fabrics transports for handling
the CAP register
- reset race with ns scanning fix for fabrics (move fabrics
commands to a dedicated request queue with a different lifetime
from the admin request queue)."
- controller reset and namespace scan races fixes
- nvme discovery log change uevent support
- naming improvements from Keith
- multiple discovery controllers reject fix from James
- some regular cleanups from various people
- Series fixing (and re-fixing) null_blk debug printing and nr_devices
checks (André)
- A few pull requests from Song, with fixes from Andy, Guoqing,
Guilherme, Neil, Nigel, and Yufen.
- REQ_OP_ZONE_RESET_ALL support (Chaitanya)
- Bio merge handling unification (Christoph)
- Pick default elevator correctly for devices with special needs
(Damien)
- Block stats fixes (Hou)
- Timeout and support devices nbd fixes (Mike)
- Series fixing races around elevator switching and device add/remove
(Ming)
- sed-opal cleanups (Revanth)
- Per device weight support for BFQ (Fam)
- Support for blk-iocost, a new model that can properly account cost of
IO workloads. (Tejun)
- blk-cgroup writeback fixes (Tejun)
- paride queue init fixes (zhengbin)
- blk_set_runtime_active() cleanup (Stanley)
- Block segment mapping optimizations (Bart)
- lightnvm fixes (Hans/Minwoo/YueHaibing)
- Various little fixes and cleanups
* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
null_blk: format pr_* logs with pr_fmt
null_blk: match the type of parameter nr_devices
null_blk: do not fail the module load with zero devices
block: also check RQF_STATS in blk_mq_need_time_stamp()
block: make rq sector size accessible for block stats
bfq: Fix bfq linkage error
raid5: use bio_end_sector in r5_next_bio
raid5: remove STRIPE_OPS_REQ_PENDING
md: add feature flag MD_FEATURE_RAID0_LAYOUT
md/raid0: avoid RAID0 data corruption due to layout confusion.
raid5: don't set STRIPE_HANDLE to stripe which is in batch list
raid5: don't increment read_errors on EILSEQ return
nvmet: fix a wrong error status returned in error log page
nvme: send discovery log page change events to userspace
nvme: add uevent variables for controller devices
nvme: enable aen regardless of the presence of I/O queues
nvme-fabrics: allow discovery subsystems accept a kato
nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
nvme: Remove redundant assignment of cq vector
nvme: Assign subsys instance from first ctrl
...
107 files changed, 5895 insertions, 1283 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 5f1c266131b0..0fa8c0e615c2 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1469,6 +1469,103 @@ IO Interface Files 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 + io.cost.qos + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the Quality of Service of the IO cost + model based controller (CONFIG_BLK_CGROUP_IOCOST) which + currently implements "io.weight" proportional control. Lines + are keyed by $MAJ:$MIN device numbers and not ordered. The + line for a given device is populated on the first write for + the device on "io.cost.qos" or "io.cost.model". The following + nested keys are defined. + + ====== ===================================== + enable Weight-based control enable + ctrl "auto" or "user" + rpct Read latency percentile [0, 100] + rlat Read latency threshold + wpct Write latency percentile [0, 100] + wlat Write latency threshold + min Minimum scaling percentage [1, 10000] + max Maximum scaling percentage [1, 10000] + ====== ===================================== + + The controller is disabled by default and can be enabled by + setting "enable" to 1. "rpct" and "wpct" parameters default + to zero and the controller uses internal device saturation + state to adjust the overall IO rate between "min" and "max". + + When a better control quality is needed, latency QoS + parameters can be configured. For example:: + + 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 + + shows that on sdb, the controller is enabled, will consider + the device saturated if the 95th percentile of read completion + latencies is above 75ms or write 150ms, and adjust the overall + IO issue rate between 50% and 150% accordingly. + + The lower the saturation point, the better the latency QoS at + the cost of aggregate bandwidth. The narrower the allowed + adjustment range between "min" and "max", the more conformant + to the cost model the IO behavior. Note that the IO issue + base rate may be far off from 100% and setting "min" and "max" + blindly can lead to a significant loss of device capacity or + control quality. "min" and "max" are useful for regulating + devices which show wide temporary behavior changes - e.g. a + ssd which accepts writes at the line speed for a while and + then completely stalls for multiple seconds. + + When "ctrl" is "auto", the parameters are controlled by the + kernel and may change automatically. Setting "ctrl" to "user" + or setting any of the percentile and latency parameters puts + it into "user" mode and disables the automatic changes. The + automatic mode can be restored by setting "ctrl" to "auto". + + io.cost.model + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the cost model of the IO cost model based + controller (CONFIG_BLK_CGROUP_IOCOST) which currently + implements "io.weight" proportional control. Lines are keyed + by $MAJ:$MIN device numbers and not ordered. The line for a + given device is populated on the first write for the device on + "io.cost.qos" or "io.cost.model". The following nested keys + are defined. + + ===== ================================ + ctrl "auto" or "user" + model The cost model in use - "linear" + ===== ================================ + + When "ctrl" is "auto", the kernel may change all parameters + dynamically. When "ctrl" is set to "user" or any other + parameters are written to, "ctrl" become "user" and the + automatic changes are disabled. + + When "model" is "linear", the following model parameters are + defined. + + ============= ======================================== + [r|w]bps The maximum sequential IO throughput + [r|w]seqiops The maximum 4k sequential IOs per second + [r|w]randiops The maximum 4k random IOs per second + ============= ======================================== + + From the above, the builtin linear model determines the base + costs of a sequential and random IO and the cost coefficient + for the IO size. While simple, this model can cover most + common device classes acceptably. + + The IO cost model isn't expected to be accurate in absolute + sense and is scaled to the device behavior dynamically. + + If needed, tools/cgroup/iocost_coef_gen.py can be used to + generate device-specific coefficients. + io.weight A read-write flat-keyed file which exists on non-root cgroups. The default is "default 100". diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3e2d22eb5e59..6ef205fd7c97 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1201,12 +1201,6 @@ See comment before function elanfreq_setup() in arch/x86/kernel/cpu/cpufreq/elanfreq.c. - elevator= [IOSCHED] - Format: { "mq-deadline" | "kyber" | "bfq" } - See Documentation/block/deadline-iosched.rst, - Documentation/block/kyber-iosched.rst and - Documentation/block/bfq-iosched.rst for details. - elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390] Specifies physical address of start of kernel core image elf header and optionally the size. Generally diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index 4f18456dd3b1..baeeba8762ae 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -274,9 +274,7 @@ To reduce its OS jitter, do any of the following: (based on an earlier one from Gilad Ben-Yossef) that reduces or even eliminates vmstat overhead for some workloads at https://lkml.org/lkml/2013/9/4/379. - e. Boot with "elevator=noop" to avoid workqueue use by - the block layer. - f. If running on high-end powerpc servers, build with + e. If running on high-end powerpc servers, build with CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS daemon from running on each CPU every second or so. (This will require editing Kconfig files and will defeat @@ -284,12 +282,12 @@ To reduce its OS jitter, do any of the following: due to the rtas_event_scan() function. WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - g. If running on Cell Processor, build your kernel with + f. If running on Cell Processor, build your kernel with CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from spu_gov_work(). WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - h. If running on PowerMAC, build your kernel with + g. If running on PowerMAC, build your kernel with CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, avoiding OS jitter from rackmeter_do_timer(). diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst index 31451d80783c..edbbab2f12f8 100644 --- a/Documentation/block/null_blk.rst +++ b/Documentation/block/null_blk.rst @@ -1,19 +1,16 @@ +.. SPDX-License-Identifier: GPL-2.0 + ======================== Null block device driver ======================== -1. Overview -=========== +Overview +======== -The null block device (/dev/nullb*) is used for benchmarking the various +The null block device (``/dev/nullb*``) is used for benchmarking the various block-layer implementations. It emulates a block device of X gigabytes in size. -The following instances are possible: - - Single-queue block-layer - - - Request-based. - - Single submission queue per device. - - Implements IO scheduling algorithms (CFQ, Deadline, noop). +It does not execute any read/write operation, just mark them as complete in +the request queue. The following instances are possible: Multi-queue block-layer @@ -27,15 +24,15 @@ The following instances are possible: All of them have a completion queue for each core in the system. -2. Module parameters applicable for all instances -================================================= +Module parameters +================= queue_mode=[0-2]: Default: 2-Multi-queue Selects which block-layer the module should instantiate with. = ============ 0 Bio-based - 1 Single-queue + 1 Single-queue (deprecated) 2 Multi-queue = ============ @@ -67,7 +64,7 @@ irqmode=[0-2]: Default: 1-Soft-irq completion_nsec=[ns]: Default: 10,000ns Combined with irqmode=2 (timer). The time each completion event must wait. -submit_queues=[1..nr_cpus]: +submit_queues=[1..nr_cpus]: Default: 1 The number of submission queues attached to the device driver. If unset, it defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module parameter is 1. @@ -75,9 +72,11 @@ submit_queues=[1..nr_cpus]: hw_queue_depth=[0..qdepth]: Default: 64 The hardware queue depth of the device. -III: Multi-queue specific parameters +Multi-queue specific parameters +------------------------------- use_per_node_hctx=[0/1]: Default: 0 + Number of hardware context queues. = ===================================================================== 0 The number of submit queues are set to the value of the submit_queues @@ -87,6 +86,7 @@ use_per_node_hctx=[0/1]: Default: 0 = ===================================================================== no_sched=[0/1]: Default: 0 + Enable/disable the io scheduler. = ====================================== 0 nullb* use default blk-mq io scheduler @@ -94,6 +94,7 @@ no_sched=[0/1]: Default: 0 = ====================================== blocking=[0/1]: Default: 0 + Blocking behavior of the request queue. = =============================================================== 0 Register as a non-blocking blk-mq driver device. @@ -103,6 +104,7 @@ blocking=[0/1]: Default: 0 = =============================================================== shared_tags=[0/1]: Default: 0 + Sharing tags between devices. = ================================================================ 0 Tag set is not shared. @@ -111,6 +113,7 @@ shared_tags=[0/1]: Default: 0 = ================================================================ zoned=[0/1]: Default: 0 + Device is a random-access or a zoned block device. = ====================================================================== 0 Block device is exposed as a random-access block device. diff --git a/Documentation/block/switching-sched.rst b/Documentation/block/switching-sched.rst index 42042417380e..520f6b857544 100644 --- a/Documentation/block/switching-sched.rst +++ b/Documentation/block/switching-sched.rst @@ -2,10 +2,6 @@ Switching Scheduler =================== -To choose IO schedulers at boot time, use the argument 'elevator=deadline'. -'noop' and 'cfq' (the default) are also available. IO schedulers are assigned -globally at boot time only presently. - Each io queue has a set of io scheduler tunables associated with it. These tunables control how the io scheduler works. You can find these entries in:: diff --git a/block/Kconfig b/block/Kconfig index 8b5f8e560eb4..41c0917ce622 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -26,6 +26,9 @@ menuconfig BLOCK if BLOCK +config BLK_RQ_ALLOC_TIME + bool + config BLK_SCSI_REQUEST bool @@ -132,6 +135,16 @@ config BLK_CGROUP_IOLATENCY Note, this is an experimental interface and could be changed someday. +config BLK_CGROUP_IOCOST + bool "Enable support for cost model based cgroup IO controller" + depends on BLK_CGROUP=y + select BLK_RQ_ALLOC_TIME + ---help--- + Enabling this option enables the .weight interface for cost + model based proportional IO control. The IO controller + distributes IO capacity between different groups based on + their share of the overall weight distribution. + config BLK_WBT_MQ bool "Multiqueue writeback throttling" default y diff --git a/block/Makefile b/block/Makefile index eee1b4ceecf9..9ef57ace90d4 100644 --- a/block/Makefile +++ b/block/Makefile @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o +obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 0f6cd688924f..86a607cf19a1 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -501,11 +501,12 @@ static void bfq_cpd_free(struct blkcg_policy_data *cpd) kfree(cpd_to_bfqgd(cpd)); } -static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node) +static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, struct request_queue *q, + struct blkcg *blkcg) { struct bfq_group *bfqg; - bfqg = kzalloc_node(sizeof(*bfqg), gfp, node); + bfqg = kzalloc_node(sizeof(*bfqg), gfp, q->node); if (!bfqg) return NULL; @@ -904,7 +905,7 @@ void bfq_end_wr_async(struct bfq_data *bfqd) bfq_end_wr_async_queues(bfqd, bfqd->root_group); } -static int bfq_io_show_weight(struct seq_file *sf, void *v) +static int bfq_io_show_weight_legacy(struct seq_file *sf, void *v) { struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg); @@ -918,6 +919,60 @@ static int bfq_io_show_weight(struct seq_file *sf, void *v) return 0; } +static u64 bfqg_prfill_weight_device(struct seq_file *sf, + struct blkg_policy_data *pd, int off) +{ + struct bfq_group *bfqg = pd_to_bfqg(pd); + + if (!bfqg->entity.dev_weight) + return 0; + return __blkg_prfill_u64(sf, pd, bfqg->entity.dev_weight); +} + +static int bfq_io_show_weight(struct seq_file *sf, void *v) +{ + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); + struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg); + + seq_printf(sf, "default %u\n", bfqgd->weight); + blkcg_print_blkgs(sf, blkcg, bfqg_prfill_weight_device, + &blkcg_policy_bfq, 0, false); + return 0; +} + +static void bfq_group_set_weight(struct bfq_group *bfqg, u64 weight, u64 dev_weight) +{ + weight = dev_weight ?: weight; + + bfqg->entity.dev_weight = dev_weight; + /* + * Setting the prio_changed flag of the entity + * to 1 with new_weight == weight would re-set + * the value of the weight to its ioprio mapping. + * Set the flag only if necessary. + */ + if ((unsigned short)weight != bfqg->entity.new_weight) { + bfqg->entity.new_weight = (unsigned short)weight; + /* + * Make sure that the above new value has been + * stored in bfqg->entity.new_weight before + * setting the prio_changed flag. In fact, + * this flag may be read asynchronously (in + * critical sections protected by a different + * lock than that held here), and finding this + * flag set may cause the execution of the code + * for updating parameters whose value may + * depend also on bfqg->entity.new_weight (in + * __bfq_entity_update_weight_prio). + * This barrier makes sure that the new value + * of bfqg->entity.new_weight is correctly + * seen in that code. + */ + smp_wmb(); + bfqg->entity.prio_changed = 1; + } +} + static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css, struct cftype *cftype, u64 val) @@ -936,55 +991,72 @@ static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css, hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) { struct bfq_group *bfqg = blkg_to_bfqg(blkg); - if (!bfqg) - continue; - /* - * Setting the prio_changed flag of the entity - * to 1 with new_weight == weight would re-set - * the value of the weight to its ioprio mapping. - * Set the flag only if necessary. - */ - if ((unsigned short)val != bfqg->entity.new_weight) { - bfqg->entity.new_weight = (unsigned short)val; - /* - * Make sure that the above new value has been - * stored in bfqg->entity.new_weight before - * setting the prio_changed flag. In fact, - * this flag may be read asynchronously (in - * critical sections protected by a different - * lock than that held here), and finding this - * flag set may cause the execution of the code - * for updating parameters whose value may - * depend also on bfqg->entity.new_weight (in - * __bfq_entity_update_weight_prio). - * This barrier makes sure that the new value - * of bfqg->entity.new_weight is correctly - * seen in that code. - */ - smp_wmb(); - bfqg->entity.prio_changed = 1; - } + if (bfqg) + bfq_group_set_weight(bfqg, val, 0); } spin_unlock_irq(&blkcg->lock); return ret; } -static ssize_t bfq_io_set_weight(struct kernfs_open_file *of, - char *buf, size_t nbytes, - loff_t off) +static ssize_t bfq_io_set_device_weight(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) { - u64 weight; - /* First unsigned long found in the file is used */ - int ret = kstrtoull(strim(buf), 0, &weight); + int ret; + struct blkg_conf_ctx ctx; + struct blkcg *blkcg = css_to_blkcg(of_css(of)); + struct bfq_group *bfqg; + u64 v; + ret = blkg_conf_prep(blkcg, &blkcg_policy_bfq, buf, &ctx); if (ret) return ret; - ret = bfq_io_set_weight_legacy(of_css(of), NULL, weight); + if (sscanf(ctx.body, "%llu", &v) == 1) { + /* require "default" on dfl */ + ret = -ERANGE; + if (!v) + goto out; + } else if (!strcmp(strim(ctx.body), "default")) { + v = 0; + } else { + ret = -EINVAL; + goto out; + } + + bfqg = blkg_to_bfqg(ctx.blkg); + + ret = -ERANGE; + if (!v || (v >= BFQ_MIN_WEIGHT && v <= BFQ_MAX_WEIGHT)) { + bfq_group_set_weight(bfqg, bfqg->entity.weight, v); + ret = 0; + } +out: + blkg_conf_finish(&ctx); return ret ?: nbytes; } +static ssize_t bfq_io_set_weight(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + char *endp; + int ret; + u64 v; + + buf = strim(buf); + + /* "WEIGHT" or "default WEIGHT" sets the default weight */ + v = simple_strtoull(buf, &endp, 0); + if (*endp == '\0' || sscanf(buf, "default %llu", &v) == 1) { + ret = bfq_io_set_weight_legacy(of_css(of), NULL, v); + return ret ?: nbytes; + } + + return bfq_io_set_device_weight(of, buf, nbytes, off); +} + #ifdef CONFIG_BFQ_CGROUP_DEBUG static int bfqg_print_stat(struct seq_file *sf, void *v) { @@ -1141,9 +1213,15 @@ struct cftype bfq_blkcg_legacy_files[] = { { .name = "bfq.weight", .flags = CFTYPE_NOT_ON_ROOT, - .seq_show = bfq_io_show_weight, + .seq_show = bfq_io_show_weight_legacy, .write_u64 = bfq_io_set_weight_legacy, }, + { + .name = "bfq.weight_device", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = bfq_io_show_weight, + .write = bfq_io_set_weight, + }, /* statistics, covers only the tasks in the bfqg */ { diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index e80adf822bbe..5d1a519640f6 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -168,6 +168,9 @@ struct bfq_entity { /* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */ int budget; + /* device weight, if non-zero, it overrides the default weight of + * bfq_group_data */ + int dev_weight; /* weight of the queue */ int weight; /* next weight if a change is in progress */ diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index c9ba225081ce..05f0bf4a1144 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -744,6 +744,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st, } #endif + /* Matches the smp_wmb() in bfq_group_set_weight. */ + smp_rmb(); old_st->wsum -= entity->weight; if (entity->new_weight != entity->orig_weight) { diff --git a/block/bio.c b/block/bio.c index 299a0e7651ec..8f0ed6228fc5 100644 --- a/block/bio.c +++ b/block/bio.c @@ -646,25 +646,20 @@ static inline bool page_is_mergeable(const struct bio_vec *bv, return true; } -/* - * Check if the @page can be added to the current segment(@bv), and make - * sure to call it only if page_is_mergeable(@bv, @page) is true - */ -static bool can_add_page_to_seg(struct request_queue *q, - struct bio_vec *bv, struct page *page, unsigned len, - unsigned offset) +static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio, + struct page *page, unsigned len, unsigned offset, + bool *same_page) { + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; unsigned long mask = queue_segment_boundary(q); phys_addr_t addr1 = page_to_phys(bv->bv_page) + bv->bv_offset; phys_addr_t addr2 = page_to_phys(page) + offset + len - 1; if ((addr1 | mask) != (addr2 | mask)) return false; - if (bv->bv_len + len > queue_max_segment_size(q)) return false; - - return true; + return __bio_try_merge_page(bio, page, len, offset, same_page); } /** @@ -674,7 +669,7 @@ static bool can_add_page_to_seg(struct request_queue *q, * @page: page to add * @len: vec entry length * @offset: vec entry offset - * @put_same_page: put the page if it is same with last added page + * @same_page: return if the merge happen inside the same page * * Attempt to add a page to the bio_vec maplist. This can fail for a * number of reasons, such as the bio being full or target block device @@ -685,10 +680,9 @@ static bool can_add_page_to_seg(struct request_queue *q, */ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset, - bool put_same_page) + bool *same_page) { struct bio_vec *bvec; - bool same_page = false; /* * cloned bio must not modify vec list @@ -700,28 +694,16 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, return 0; if (bio->bi_vcnt > 0) { - bvec = &bio->bi_io_vec[bio->bi_vcnt - 1]; - - if (page == bvec->bv_page && - offset == bvec->bv_offset + bvec->bv_len) { - if (put_same_page) - put_page(page); - bvec->bv_len += len; - goto done; - } + if (bio_try_merge_pc_page(q, bio, page, len, offset, same_page)) + return len; /* - * If the queue doesn't support SG gaps and adding this - * offset would create a gap, disallow it. + * If the queue doesn't support SG gaps and adding this segment + * would create a gap, disallow it. */ + bvec = &bio->bi_io_vec[bio->bi_vcnt - 1]; if (bvec_gap_to_prev(q, bvec, offset)) return 0; - - if (page_is_mergeable(bvec, page, len, offset, &same_page) && - can_add_page_to_seg(q, bvec, page, len, offset)) { - bvec->bv_len += len; - goto done; - } } if (bio_full(bio, len)) @@ -735,7 +717,6 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, bvec->bv_len = len; bvec->bv_offset = offset; bio->bi_vcnt++; - done: bio->bi_iter.bi_size += len; return len; } @@ -743,7 +724,8 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset) { - return __bio_add_pc_page(q, bio, page, len, offset, false); + bool same_page = false; + return __bio_add_pc_page(q, bio, page, len, offset, &same_page); } EXPORT_SYMBOL(bio_add_pc_page); @@ -806,6 +788,9 @@ void __bio_add_page(struct bio *bio, struct page *page, bio->bi_iter.bi_size += len; bio->bi_vcnt++; + + if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page))) + bio_set_flag(bio, BIO_WORKINGSET); } EXPORT_SYMBOL_GPL(__bio_add_page); @@ -1384,13 +1369,17 @@ struct bio *bio_map_user_iov(struct request_queue *q, for (j = 0; j < npages; j++) { struct page *page = pages[j]; unsigned int n = PAGE_SIZE - offs; + bool same_page = false; if (n > bytes) n = bytes; if (!__bio_add_pc_page(q, bio, page, n, offs, - true)) + &same_page)) { + if (same_page) + put_page(page); break; + } added += n; bytes -= n; @@ -1521,7 +1510,6 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len, bio->bi_end_io = bio_map_kern_endio; return bio; } -EXPORT_SYMBOL(bio_map_kern); static void bio_copy_kern_endio(struct bio *bio) { @@ -1842,8 +1830,8 @@ EXPORT_SYMBOL(bio_endio); * @bio, and updates @bio to represent the remaining sectors. * * Unless this is a discard request the newly allocated bio will point - * to @bio's bi_io_vec; it is the caller's responsibility to ensure that - * @bio is not freed before the split. + * to @bio's bi_io_vec. It is the caller's responsibility to ensure that + * neither @bio nor @bs are freed before the split bio. */ struct bio *bio_split(struct bio *bio, int sectors, gfp_t gfp, struct bio_set *bs) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 55a7dc227dfb..b6f20be0fc78 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -175,7 +175,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q, continue; /* alloc per-policy data and attach it to blkg */ - pd = pol->pd_alloc_fn(gfp_mask, q->node); + pd = pol->pd_alloc_fn(gfp_mask, q, blkcg); if (!pd) goto err_free; @@ -755,6 +755,44 @@ static struct blkcg_gq *blkg_lookup_check(struct blkcg *blkcg, /** * blkg_conf_prep - parse and prepare for per-blkg config update + * @inputp: input string pointer + * + * Parse the device node prefix part, MAJ:MIN, of per-blkg config update + * from @input and get and return the matching gendisk. *@inputp is + * updated to point past the device node prefix. Returns an ERR_PTR() + * value on error. + * + * Use this function iff blkg_conf_prep() can't be used for some reason. + */ +struct gendisk *blkcg_conf_get_disk(char **inputp) +{ + char *input = *inputp; + unsigned int major, minor; + struct gendisk *disk; + int key_len, part; + + if (sscanf(input, "%u:%u%n", &major, &minor, &key_len) != 2) + return ERR_PTR(-EINVAL); + + input += key_len; + if (!isspace(*input)) + return ERR_PTR(-EINVAL); + input = skip_spaces(input); + + disk = get_gendisk(MKDEV(major, minor), &part); + if (!disk) + return ERR_PTR(-ENODEV); + if (part) { + put_disk_and_module(disk); + return ERR_PTR(-ENODEV); + } + + *inputp = input; + return disk; +} + +/** + * blkg_conf_prep - parse and prepare for per-blkg config update * @blkcg: target block cgroup * @pol: target policy * @input: input string @@ -772,25 +810,11 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, struct gendisk *disk; struct request_queue *q; struct blkcg_gq *blkg; - unsigned int major, minor; - int key_len, part, ret; - char *body; - - if (sscanf(input, "%u:%u%n", &major, &minor, &key_len) != 2) - return -EINVAL; - - body = input + key_len; - if (!isspace(*body)) - return -EINVAL; - body = skip_spaces(body); + int ret; - disk = get_gendisk(MKDEV(major, minor), &part); - if (!disk) - return -ENODEV; - if (part) { - ret = -ENODEV; - goto fail; - } + disk = blkcg_conf_get_disk(&input); + if (IS_ERR(disk)) + return PTR_ERR(disk); q = disk->queue; @@ -856,7 +880,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, success: ctx->disk = disk; ctx->blkg = blkg; - ctx->body = body; + ctx->body = input; return 0; fail_unlock: @@ -876,6 +900,7 @@ fail: } return ret; } +EXPORT_SYMBOL_GPL(blkg_conf_prep); /** * blkg_conf_finish - finish up per-blkg config update @@ -891,6 +916,7 @@ void blkg_conf_finish(struct blkg_conf_ctx *ctx) rcu_read_unlock(); put_disk_and_module(ctx->disk); } +EXPORT_SYMBOL_GPL(blkg_conf_finish); static int blkcg_print_stat(struct seq_file *sf, void *v) { @@ -1346,7 +1372,7 @@ int blkcg_activate_policy(struct request_queue *q, blk_mq_freeze_queue(q); pd_prealloc: if (!pd_prealloc) { - pd_prealloc = pol->pd_alloc_fn(GFP_KERNEL, q->node); + pd_prealloc = pol->pd_alloc_fn(GFP_KERNEL, q, &blkcg_root); if (!pd_prealloc) { ret = -ENOMEM; goto out_bypass_end; @@ -1362,7 +1388,7 @@ pd_prealloc: if (blkg->pd[pol->plid]) continue; - pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q->node); + pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q, &blkcg_root); if (!pd) swap(pd, pd_prealloc); if (!pd) { @@ -1475,7 +1501,8 @@ int blkcg_policy_register(struct blkcg_policy *pol) blkcg->cpd[pol->plid] = cpd; cpd->blkcg = blkcg; cpd->plid = pol->plid; - pol->cpd_init_fn(cpd); + if (pol->cpd_init_fn) + pol->cpd_init_fn(cpd); } } diff --git a/block/blk-core.c b/block/blk-core.c index d0cc6e14d2f0..875e8d105067 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -36,6 +36,7 @@ #include <linux/blk-cgroup.h> #include <linux/debugfs.h> #include <linux/bpf.h> +#include <linux/psi.h> #define CREATE_TRACE_POINTS #include <trace/events/block.h> @@ -129,6 +130,7 @@ static const char *const blk_op_name[] = { REQ_OP_NAME(DISCARD), REQ_OP_NAME(SECURE_ERASE), REQ_OP_NAME(ZONE_RESET), + REQ_OP_NAME(ZONE_RESET_ALL), REQ_OP_NAME(WRITE_SAME), REQ_OP_NAME(WRITE_ZEROES), REQ_OP_NAME(SCSI_IN), @@ -344,7 +346,8 @@ void blk_cleanup_queue(struct request_queue *q) /* * Drain all requests queued before DYING marking. Set DEAD flag to - * prevent that q->request_fn() gets invoked after draining finished. + * prevent that blk_mq_run_hw_queues() accesses the hardware queues + * after draining finished. */ blk_freeze_queue(q); @@ -479,7 +482,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) if (!q) return NULL; - INIT_LIST_HEAD(&q->queue_head); q->last_merge = NULL; q->id = ida_simple_get(&blk_queue_ida, 0, 0, gfp_mask); @@ -518,6 +520,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) mutex_init(&q->blk_trace_mutex); #endif mutex_init(&q->sysfs_lock); + mutex_init(&q->sysfs_dir_lock); spin_lock_init(&q->queue_lock); init_waitqueue_head(&q->mq_freeze_wq); @@ -601,6 +604,7 @@ bool bio_attempt_back_merge(struct request *req, struct bio *bio, return false; trace_block_bio_backmerge(req->q, req, bio); + rq_qos_merge(req->q, req, bio); if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) blk_rq_set_mixed_merge(req); @@ -622,6 +626,7 @@ bool bio_attempt_front_merge(struct request *req, struct bio *bio, return false; trace_block_bio_frontmerge(req->q, req, bio); + rq_qos_merge(req->q, req, bio); if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) blk_rq_set_mixed_merge(req); @@ -647,6 +652,8 @@ bool bio_attempt_discard_merge(struct request_queue *q, struct request *req, blk_rq_get_max_sectors(req, blk_rq_pos(req))) goto no_merge; + rq_qos_merge(q, req, bio); + req->biotail->bi_next = bio; req->biotail = bio; req->__data_len += bio->bi_iter.bi_size; @@ -931,6 +938,10 @@ generic_make_request_checks(struct bio *bio) if (!blk_queue_is_zoned(q)) goto not_supported; break; + case REQ_OP_ZONE_RESET_ALL: + if (!blk_queue_is_zoned(q) || !blk_queue_zone_resetall(q)) + goto not_supported; + break; case REQ_OP_WRITE_ZEROES: if (!q->limits.max_write_zeroes_sectors) goto not_supported; @@ -1128,6 +1139,10 @@ EXPORT_SYMBOL_GPL(direct_make_request); */ blk_qc_t submit_bio(struct bio *bio) { + bool workingset_read = false; + unsigned long pflags; + blk_qc_t ret; + if (blkcg_punt_bio_submit(bio)) return BLK_QC_T_NONE; @@ -1146,6 +1161,8 @@ blk_qc_t submit_bio(struct bio *bio) if (op_is_write(bio_op(bio))) { count_vm_events(PGPGOUT, count); } else { + if (bio_flagged(bio, BIO_WORKINGSET)) + workingset_read = true; task_io_account_read(bio->bi_iter.bi_size); count_vm_events(PGPGIN, count); } @@ -1160,7 +1177,21 @@ blk_qc_t submit_bio(struct bio *bio) } } - return generic_make_request(bio); + /* + * If we're reading data that is part of the userspace + * workingset, count submission time as memory stall. When the + * device is congested, or the submitting cgroup IO-throttled, + * submission can be a significant part of overall IO time. + */ + if (workingset_read) + psi_memstall_enter(&pflags); + + ret = generic_make_request(bio); + + if (workingset_read) + psi_memstall_leave(&pflags); + + return ret; } EXPORT_SYMBOL(submit_bio); diff --git a/block/blk-iocost.c b/block/blk-iocost.c new file mode 100644 index 000000000000..3b39deb8b9f8 --- /dev/null +++ b/block/blk-iocost.c @@ -0,0 +1,2457 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * IO cost model based controller. + * + * Copyright (C) 2019 Tejun Heo <tj@kernel.org> + * Copyright (C) 2019 Andy Newell <newella@fb.com> + * Copyright (C) 2019 Facebook + * + * One challenge of controlling IO resources is the lack of trivially + * observable cost metric. This is distinguished from CPU and memory where + * wallclock time and the number of bytes can serve as accurate enough + * approximations. + * + * Bandwidth and iops are the most commonly used metrics for IO devices but + * depending on the type and specifics of the device, different IO patterns + * easily lead to multiple orders of magnitude variations rendering them + * useless for the purpose of IO capacity distribution. While on-device + * time, with a lot of clutches, could serve as a useful approximation for + * non-queued rotational devices, this is no longer viable with modern + * devices, even the rotational ones. + * + * While there is no cost metric we can trivially observe, it isn't a + * complete mystery. For example, on a rotational device, seek cost + * dominates while a contiguous transfer contributes a smaller amount + * proportional to the size. If we can characterize at least the relative + * costs of these different types of IOs, it should be possible to + * implement a reasonable work-conserving proportional IO resource + * distribution. + * + * 1. IO Cost Model + * + * IO cost model estimates the cost of an IO given its basic parameters and + * history (e.g. the end sector of the last IO). The cost is measured in + * device time. If a given IO is estimated to cost 10ms, the device should + * be able to process ~100 of those IOs in a second. + * + * Currently, there's only one builtin cost model - linear. Each IO is + * classified as sequential or random and given a base cost accordingly. + * On top of that, a size cost proportional to the length of the IO is + * added. While simple, this model captures the operational + * characteristics of a wide varienty of devices well enough. Default + * paramters for several different classes of devices are provided and the + * parameters can be configured from userspace via + * /sys/fs/cgroup/io.cost.model. + * + * If needed, tools/cgroup/iocost_coef_gen.py can be used to generate + * device-specific coefficients. + * + * If needed, tools/cgroup/iocost_coef_gen.py can be used to generate + * device-specific coefficients. + * + * 2. Control Strategy + * + * The device virtual time (vtime) is used as the primary control metric. + * The control strategy is composed of the following three parts. + * + * 2-1. Vtime Distribution + * + * When a cgroup becomes active in terms of IOs, its hierarchical share is + * calculated. Please consider the following hierarchy where the numbers + * inside parentheses denote the configured weights. + * + * root + * / \ + * A (w:100) B (w:300) + * / \ + * A0 (w:100) A1 (w:100) + * + * If B is idle and only A0 and A1 are actively issuing IOs, as the two are + * of equal weight, each gets 50% share. If then B starts issuing IOs, B + * gets 300/(100+300) or 75% share, and A0 and A1 equally splits the rest, + * 12.5% each. The distribution mechanism only cares about these flattened + * shares. They're called hweights (hierarchical weights) and always add + * upto 1 (HWEIGHT_WHOLE). + * + * A given cgroup's vtime runs slower in inverse proportion to its hweight. + * For example, with 12.5% weight, A0's time runs 8 times slower (100/12.5) + * against the device vtime - an IO which takes 10ms on the underlying + * device is considered to take 80ms on A0. + * + * This constitutes the basis of IO capacity distribution. Each cgroup's + * vtime is running at a rate determined by its hweight. A cgroup tracks + * the vtime consumed by past IOs and can issue a new IO iff doing so + * wouldn't outrun the current device vtime. Otherwise, the IO is + * suspended until the vtime has progressed enough to cover it. + * + * 2-2. Vrate Adjustment + * + * It's unrealistic to expect the cost model to be perfect. There are too + * many devices and even on the same device the overall performance + * fluctuates depending on numerous factors such as IO mixture and device + * internal garbage collection. The controller needs to adapt dynamically. + * + * This is achieved by adjusting the overall IO rate according to how busy + * the device is. If the device becomes overloaded, we're sending down too + * many IOs and should generally slow down. If there are waiting issuers + * but the device isn't saturated, we're issuing too few and should + * generally speed up. + * + * To slow down, we lower the vrate - the rate at which the device vtime + * passes compared to the wall clock. For example, if the vtime is running + * at the vrate of 75%, all cgroups added up would only be able to issue + * 750ms worth of IOs per second, and vice-versa for speeding up. + * + * Device business is determined using two criteria - rq wait and + * completion latencies. + * + * When a device gets saturated, the on-device and then the request queues + * fill up and a bio which is ready to be issued has to wait for a request + * to become available. When this delay becomes noticeable, it's a clear + * indication that the device is saturated and we lower the vrate. This + * saturation signal is fairly conservative as it only triggers when both + * hardware and software queues are filled up, and is used as the default + * busy signal. + * + * As devices can have deep queues and be unfair in how the queued commands + * are executed, soley depending on rq wait may not result in satisfactory + * control quality. For a better control quality, completion latency QoS + * parameters can be configured so that the device is considered saturated + * if N'th percentile completion latency rises above the set point. + * + * The completion latency requirements are a function of both the + * underlying device characteristics and the desired IO latency quality of + * service. There is an inherent trade-off - the tighter the latency QoS, + * the higher the bandwidth lossage. Latency QoS is disabled by default + * and can be set through /sys/fs/cgroup/io.cost.qos. + * + * 2-3. Work Conservation + * + * Imagine two cgroups A and B with equal weights. A is issuing a small IO + * periodically while B is sending out enough parallel IOs to saturate the + * device on its own. Let's say A's usage amounts to 100ms worth of IO + * cost per second, i.e., 10% of the device capacity. The naive + * distribution of half and half would lead to 60% utilization of the + * device, a significant reduction in the total amount of work done + * compared to free-for-all competition. This is too high a cost to pay + * for IO control. + * + * To conserve the total amount of work done, we keep track of how much + * each active cgroup is actually using and yield part of its weight if + * there are other cgroups which can make use of it. In the above case, + * A's weight will be lowered so that it hovers above the actual usage and + * B would be able to use the rest. + * + * As we don't want to penalize a cgroup for donating its weight, the + * surplus weight adjustment factors in a margin and has an immediate + * snapback mechanism in case the cgroup needs more IO vtime for itself. + * + * Note that adjusting down surplus weights has the same effects as + * accelerating vtime for other cgroups and work conservation can also be + * implemented by adjusting vrate dynamically. However, squaring who can + * donate and should take back how much requires hweight propagations + * anyway making it easier to implement and understand as a separate + * mechanism. + * + * 3. Monitoring + * + * Instead of debugfs or other clumsy monitoring mechanisms, this + * controller uses a drgn based monitoring script - + * tools/cgroup/iocost_monitor.py. For details on drgn, please see + * https://github.com/osandov/drgn. The ouput looks like the following. + * + * sdb RUN per=300ms cur_per=234.218:v203.695 busy= +1 vrate= 62.12% + * active weight hweight% inflt% dbt delay usages% + * test/a * 50/ 50 33.33/ 33.33 27.65 2 0*041 033:033:033 + * test/b * 100/ 100 66.67/ 66.67 17.56 0 0*000 066:079:077 + * + * - per : Timer period + * - cur_per : Internal wall and device vtime clock + * - vrate : Device virtual time rate against wall clock + * - weight : Surplus-adjusted and configured weights + * - hweight : Surplus-adjusted and configured hierarchical weights + * - inflt : The percentage of in-flight IO cost at the end of last period + * - del_ms : Deferred issuer delay induction level and duration + * - usages : Usage history + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/timer.h> +#include <linux/time64.h> +#include <linux/parser.h> +#include <linux/sched/signal.h> +#include <linux/blk-cgroup.h> +#include "blk-rq-qos.h" +#include "blk-stat.h" +#include "blk-wbt.h" + +#ifdef CONFIG_TRACEPOINTS + +/* copied from TRACE_CGROUP_PATH, see cgroup-internal.h */ +#define TRACE_IOCG_PATH_LEN 1024 +static DEFINE_SPINLOCK(trace_iocg_path_lock); +static char trace_iocg_path[TRACE_IOCG_PATH_LEN]; + +#define TRACE_IOCG_PATH(type, iocg, ...) \ + do { \ + unsigned long flags; \ + if (trace_iocost_##type##_enabled()) { \ + spin_lock_irqsave(&trace_iocg_path_lock, flags); \ + cgroup_path(iocg_to_blkg(iocg)->blkcg->css.cgroup, \ + trace_iocg_path, TRACE_IOCG_PATH_LEN); \ + trace_iocost_##type(iocg, trace_iocg_path, \ + ##__VA_ARGS__); \ + spin_unlock_irqrestore(&trace_iocg_path_lock, flags); \ + } \ + } while (0) + +#else /* CONFIG_TRACE_POINTS */ +#define TRACE_IOCG_PATH(type, iocg, ...) do { } while (0) +#endif /* CONFIG_TRACE_POINTS */ + +enum { + MILLION = 1000000, + + /* timer period is calculated from latency requirements, bound it */ + MIN_PERIOD = USEC_PER_MSEC, + MAX_PERIOD = USEC_PER_SEC, + + /* + * A cgroup's vtime can run 50% behind the device vtime, which + * serves as its IO credit buffer. Surplus weight adjustment is + * immediately canceled if the vtime margin runs below 10%. + */ + MARGIN_PCT = 50, + INUSE_MARGIN_PCT = 10, + + /* Have some play in waitq timer operations */ + WAITQ_TIMER_MARGIN_PCT = 5, + + /* + * vtime can wrap well within a reasonable uptime when vrate is + * consistently raised. Don't trust recorded cgroup vtime if the + * period counter indicates that it's older than 5mins. + */ + VTIME_VALID_DUR = 300 * USEC_PER_SEC, + + /* + * Remember the past three non-zero usages and use the max for + * surplus calculation. Three slots guarantee that we remember one + * full period usage from the last active stretch even after + * partial deactivation and re-activation periods. Don't start + * giving away weight before collecting two data points to prevent + * hweight adjustments based on one partial activation period. + */ + NR_USAGE_SLOTS = 3, + MIN_VALID_USAGES = 2, + + /* 1/64k is granular enough and can easily be handled w/ u32 */ + HWEIGHT_WHOLE = 1 << 16, + + /* + * As vtime is used to calculate the cost of each IO, it needs to + * be fairly high precision. For example, it should be able to + * represent the cost of a single page worth of discard with + * suffificient accuracy. At the same time, it should be able to + * represent reasonably long enough durations to be useful and + * convenient during operation. + * + * 1s worth of vtime is 2^37. This gives us both sub-nanosecond + * granularity and days of wrap-around time even at extreme vrates. + */ + VTIME_PER_SEC_SHIFT = 37, + VTIME_PER_SEC = 1LLU << VTIME_PER_SEC_SHIFT, + VTIME_PER_USEC = VTIME_PER_SEC / USEC_PER_SEC, + + /* bound vrate adjustments within two orders of magnitude */ + VRATE_MIN_PPM = 10000, /* 1% */ + VRATE_MAX_PPM = 100000000, /* 10000% */ + + VRATE_MIN = VTIME_PER_USEC * VRATE_MIN_PPM / MILLION, + VRATE_CLAMP_ADJ_PCT = 4, + + /* if IOs end up waiting for requests, issue less */ + RQ_WAIT_BUSY_PCT = 5, + + /* unbusy hysterisis */ + UNBUSY_THR_PCT = 75, + + /* don't let cmds which take a very long time pin lagging for too long */ + MAX_LAGGING_PERIODS = 10, + + /* + * If usage% * 1.25 + 2% is lower than hweight% by more than 3%, + * donate the surplus. + */ + SURPLUS_SCALE_PCT = 125, /* * 125% */ + SURPLUS_SCALE_ABS = HWEIGHT_WHOLE / 50, /* + 2% */ + SURPLUS_MIN_ADJ_DELTA = HWEIGHT_WHOLE / 33, /* 3% */ + + /* switch iff the conditions are met for longer than this */ + AUTOP_CYCLE_NSEC = 10LLU * NSEC_PER_SEC, + + /* + * Count IO size in 4k pages. The 12bit shift helps keeping + * size-proportional components of cost calculation in closer + * numbers of digits to per-IO cost components. + */ + IOC_PAGE_SHIFT = 12, + IOC_PAGE_SIZE = 1 << IOC_PAGE_SHIFT, + IOC_SECT_TO_PAGE_SHIFT = IOC_PAGE_SHIFT - SECTOR_SHIFT, + + /* if apart further than 16M, consider randio for linear model */ + LCOEF_RANDIO_PAGES = 4096, +}; + +enum ioc_running { + IOC_IDLE, + IOC_RUNNING, + IOC_STOP, +}; + +/* io.cost.qos controls including per-dev enable of the whole controller */ +enum { + QOS_ENABLE, + QOS_CTRL, + NR_QOS_CTRL_PARAMS, +}; + +/* io.cost.qos params */ +enum { + QOS_RPPM, + QOS_RLAT, + QOS_WPPM, + QOS_WLAT, + QOS_MIN, + QOS_MAX, + NR_QOS_PARAMS, +}; + +/* io.cost.model controls */ +enum { + COST_CTRL, + COST_MODEL, + NR_COST_CTRL_PARAMS, +}; + +/* builtin linear cost model coefficients */ +enum { + I_LCOEF_RBPS, + I_LCOEF_RSEQIOPS, + I_LCOEF_RRANDIOPS, + I_LCOEF_WBPS, + I_LCOEF_WSEQIOPS, + I_LCOEF_WRANDIOPS, + NR_I_LCOEFS, +}; + +enum { + LCOEF_RPAGE, + LCOEF_RSEQIO, + LCOEF_RRANDIO, + LCOEF_WPAGE, + LCOEF_WSEQIO, + LCOEF_WRANDIO, + NR_LCOEFS, +}; + +enum { + AUTOP_INVALID, + AUTOP_HDD, + AUTOP_SSD_QD1, + AUTOP_SSD_DFL, + AUTOP_SSD_FAST, +}; + +struct ioc_gq; + +struct ioc_params { + u32 qos[NR_QOS_PARAMS]; + u64 i_lcoefs[NR_I_LCOEFS]; + u64 lcoefs[NR_LCOEFS]; + u32 too_fast_vrate_pct; + u32 too_slow_vrate_pct; +}; + +struct ioc_missed { + u32 nr_met; + u32 nr_missed; + u32 last_met; + u32 last_missed; +}; + +struct ioc_pcpu_stat { + struct ioc_missed missed[2]; + + u64 rq_wait_ns; + u64 last_rq_wait_ns; +}; + +/* per device */ +struct ioc { + struct rq_qos rqos; + + bool enabled; + + struct ioc_params params; + u32 period_us; + u32 margin_us; + u64 vrate_min; + u64 vrate_max; + + spinlock_t lock; + struct timer_list timer; + struct list_head active_iocgs; /* active cgroups */ + struct ioc_pcpu_stat __percpu *pcpu_stat; + + enum ioc_running running; + atomic64_t vtime_rate; + + seqcount_t period_seqcount; + u32 period_at; /* wallclock starttime */ + u64 period_at_vtime; /* vtime starttime */ + + atomic64_t cur_period; /* inc'd each period */ + int busy_level; /* saturation history */ + + u64 inuse_margin_vtime; + bool weights_updated; + atomic_t hweight_gen; /* for lazy hweights */ + + u64 autop_too_fast_at; + u64 autop_too_slow_at; + int autop_idx; + bool user_qos_params:1; + bool user_cost_model:1; +}; + +/* per device-cgroup pair */ +struct ioc_gq { + struct blkg_policy_data pd; + struct ioc *ioc; + + /* + * A iocg can get its weight from two sources - an explicit + * per-device-cgroup configuration or the default weight of the + * cgroup. `cfg_weight` is the explicit per-device-cgroup + * configuration. `weight` is the effective considering both + * sources. + * + * When an idle cgroup becomes active its `active` goes from 0 to + * `weight`. `inuse` is the surplus adjusted active weight. + * `active` and `inuse` are used to calculate `hweight_active` and + * `hweight_inuse`. + * + * `last_inuse` remembers `inuse` while an iocg is idle to persist + * surplus adjustments. + */ + u32 cfg_weight; + u32 weight; + u32 active; + u32 inuse; + u32 last_inuse; + + sector_t cursor; /* to detect randio */ + + /* + * `vtime` is this iocg's vtime cursor which progresses as IOs are + * issued. If lagging behind device vtime, the delta represents + * the currently available IO budget. If runnning ahead, the + * overage. + * + * `vtime_done` is the same but progressed on completion rather + * than issue. The delta behind `vtime` represents the cost of + * currently in-flight IOs. + * + * `last_vtime` is used to remember `vtime` at the end of the last + * period to calculate utilization. + */ + atomic64_t vtime; + atomic64_t done_vtime; + atomic64_t abs_vdebt; + u64 last_vtime; + + /* + * The period this iocg was last active in. Used for deactivation + * and invalidating `vtime`. + */ + atomic64_t active_period; + struct list_head active_list; + + /* see __propagate_active_weight() and current_hweight() for details */ + u64 child_active_sum; + u64 child_inuse_sum; + int hweight_gen; + u32 hweight_active; + u32 hweight_inuse; + bool has_surplus; + + struct wait_queue_head waitq; + struct hrtimer waitq_timer; + struct hrtimer delay_timer; + + /* usage is recorded as fractions of HWEIGHT_WHOLE */ + int usage_idx; + u32 usages[NR_USAGE_SLOTS]; + + /* this iocg's depth in the hierarchy and ancestors including self */ + int level; + struct ioc_gq *ancestors[]; +}; + +/* per cgroup */ +struct ioc_cgrp { + struct blkcg_policy_data cpd; + unsigned int dfl_weight; +}; + +struct ioc_now { + u64 now_ns; + u32 now; + u64 vnow; + u64 vrate; +}; + +struct iocg_wait { + struct wait_queue_entry wait; + struct bio *bio; + u64 abs_cost; + bool committed; +}; + +struct iocg_wake_ctx { + struct ioc_gq *iocg; + u32 hw_inuse; + s64 vbudget; +}; + +static const struct ioc_params autop[] = { + [AUTOP_HDD] = { + .qos = { + [QOS_RLAT] = 50000, /* 50ms */ + [QOS_WLAT] = 50000, + [QOS_MIN] = VRATE_MIN_PPM, + [QOS_MAX] = VRATE_MAX_PPM, + }, + .i_lcoefs = { + [I_LCOEF_RBPS] = 174019176, + [I_LCOEF_RSEQIOPS] = 41708, + [I_LCOEF_RRANDIOPS] = 370, + [I_LCOEF_WBPS] = 178075866, + [I_LCOEF_WSEQIOPS] = 42705, + [I_LCOEF_WRANDIOPS] = 378, + }, + }, + [AUTOP_SSD_QD1] = { + .qos = { + [QOS_RLAT] = 25000, /* 25ms */ + [QOS_WLAT] = 25000, + [QOS_MIN] = VRATE_MIN_PPM, + [QOS_MAX] = VRATE_MAX_PPM, + }, + .i_lcoefs = { + [I_LCOEF_RBPS] = 245855193, + [I_LCOEF_RSEQIOPS] = 61575, + [I_LCOEF_RRANDIOPS] = 6946, + [I_LCOEF_WBPS] = 141365009, + [I_LCOEF_WSEQIOPS] = 33716, + [I_LCOEF_WRANDIOPS] = 26796, + }, + }, + [AUTOP_SSD_DFL] = { + .qos = { + [QOS_RLAT] = 25000, /* 25ms */ + [QOS_WLAT] = 25000, + [QOS_MIN] = VRATE_MIN_PPM, + [QOS_MAX] = VRATE_MAX_PPM, + }, + .i_lcoefs = { + [I_LCOEF_RBPS] = 488636629, + [I_LCOEF_RSEQIOPS] = 8932, + [I_LCOEF_RRANDIOPS] = 8518, + [I_LCOEF_WBPS] = 427891549, + [I_LCOEF_WSEQIOPS] = 28755, + [I_LCOEF_WRANDIOPS] = 21940, + }, + .too_fast_vrate_pct = 500, + }, + [AUTOP_SSD_FAST] = { + .qos = { + [QOS_RLAT] = 5000, /* 5ms */ + [QOS_WLAT] = 5000, + [QOS_MIN] = VRATE_MIN_PPM, + [QOS_MAX] = VRATE_MAX_PPM, + }, + .i_lcoefs = { + [I_LCOEF_RBPS] = 3102524156LLU, + [I_LCOEF_RSEQIOPS] = 724816, + [I_LCOEF_RRANDIOPS] = 778122, + [I_LCOEF_WBPS] = 1742780862LLU, + [I_LCOEF_WSEQIOPS] = 425702, + [I_LCOEF_WRANDIOPS] = 443193, + }, + .too_slow_vrate_pct = 10, + }, +}; + +/* + * vrate adjust percentages indexed by ioc->busy_level. We adjust up on + * vtime credit shortage and down on device saturation. + */ +static u32 vrate_adj_pct[] = + { 0, 0, 0, 0, + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, + 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, + 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 16 }; + +static struct blkcg_policy blkcg_policy_iocost; + +/* accessors and helpers */ +static struct ioc *rqos_to_ioc(struct rq_qos *rqos) +{ + return container_of(rqos, struct ioc, rqos); +} + +static struct ioc *q_to_ioc(struct request_queue *q) +{ + return rqos_to_ioc(rq_qos_id(q, RQ_QOS_COST)); +} + +static const char *q_name(struct request_queue *q) +{ + if (test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) + return kobject_name(q->kobj.parent); + else + return "<unknown>"; +} + +static const char __maybe_unused *ioc_name(struct ioc *ioc) +{ + return q_name(ioc->rqos.q); +} + +static struct ioc_gq *pd_to_iocg(struct blkg_policy_data *pd) +{ + return pd ? container_of(pd, struct ioc_gq, pd) : NULL; +} + +static struct ioc_gq *blkg_to_iocg(struct blkcg_gq *blkg) +{ + return pd_to_iocg(blkg_to_pd(blkg, &blkcg_policy_iocost)); +} + +static struct blkcg_gq *iocg_to_blkg(struct ioc_gq *iocg) +{ + return pd_to_blkg(&iocg->pd); +} + +static struct ioc_cgrp *blkcg_to_iocc(struct blkcg *blkcg) +{ + return container_of(blkcg_to_cpd(blkcg, &blkcg_policy_iocost), + struct ioc_cgrp, cpd); +} + +/* + * Scale @abs_cost to the inverse of @hw_inuse. The lower the hierarchical + * weight, the more expensive each IO. Must round up. + */ +static u64 abs_cost_to_cost(u64 abs_cost, u32 hw_inuse) +{ + return DIV64_U64_ROUND_UP(abs_cost * HWEIGHT_WHOLE, hw_inuse); +} + +/* + * The inverse of abs_cost_to_cost(). Must round up. + */ +static u64 cost_to_abs_cost(u64 cost, u32 hw_inuse) +{ + return DIV64_U64_ROUND_UP(cost * hw_inuse, HWEIGHT_WHOLE); +} + +static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio, u64 cost) +{ + bio->bi_iocost_cost = cost; + atomic64_add(cost, &iocg->vtime); +} + +#define CREATE_TRACE_POINTS +#include <trace/events/iocost.h> + +/* latency Qos params changed, update period_us and all the dependent params */ +static void ioc_refresh_period_us(struct ioc *ioc) +{ + u32 ppm, lat, multi, period_us; + + lockdep_assert_held(&ioc->lock); + + /* pick the higher latency target */ + if (ioc->params.qos[QOS_RLAT] >= ioc->params.qos[QOS_WLAT]) { + ppm = ioc->params.qos[QOS_RPPM]; + lat = ioc->params.qos[QOS_RLAT]; + } else { + ppm = ioc->params.qos[QOS_WPPM]; + lat = ioc->params.qos[QOS_WLAT]; + } + + /* + * We want the period to be long enough to contain a healthy number + * of IOs while short enough for granular control. Define it as a + * multiple of the latency target. Ideally, the multiplier should + * be scaled according to the percentile so that it would nominally + * contain a certain number of requests. Let's be simpler and + * scale it linearly so that it's 2x >= pct(90) and 10x at pct(50). + */ + if (ppm) + multi = max_t(u32, (MILLION - ppm) / 50000, 2); + else + multi = 2; + period_us = multi * lat; + period_us = clamp_t(u32, period_us, MIN_PERIOD, MAX_PERIOD); + + /* calculate dependent params */ + ioc->period_us = period_us; + ioc->margin_us = period_us * MARGIN_PCT / 100; + ioc->inuse_margin_vtime = DIV64_U64_ROUND_UP( + period_us * VTIME_PER_USEC * INUSE_MARGIN_PCT, 100); +} + +static int ioc_autop_idx(struct ioc *ioc) +{ + int idx = ioc->autop_idx; + const struct ioc_params *p = &autop[idx]; + u32 vrate_pct; + u64 now_ns; + + /* rotational? */ + if (!blk_queue_nonrot(ioc->rqos.q)) + return AUTOP_HDD; + + /* handle SATA SSDs w/ broken NCQ */ + if (blk_queue_depth(ioc->rqos.q) == 1) + return AUTOP_SSD_QD1; + + /* use one of the normal ssd sets */ + if (idx < AUTOP_SSD_DFL) + return AUTOP_SSD_DFL; + + /* if user is overriding anything, maintain what was there */ + if (ioc->user_qos_params || ioc->user_cost_model) + return idx; + + /* step up/down based on the vrate */ + vrate_pct = div64_u64(atomic64_read(&ioc->vtime_rate) * 100, + VTIME_PER_USEC); + now_ns = ktime_get_ns(); + + if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) { + if (!ioc->autop_too_fast_at) + ioc->autop_too_fast_at = now_ns; + if (now_ns - ioc->autop_too_fast_at >= AUTOP_CYCLE_NSEC) + return idx + 1; + } else { + ioc->autop_too_fast_at = 0; + } + + if (p->too_slow_vrate_pct && p->too_slow_vrate_pct >= vrate_pct) { + if (!ioc->autop_too_slow_at) + ioc->autop_too_slow_at = now_ns; + if (now_ns - ioc->autop_too_slow_at >= AUTOP_CYCLE_NSEC) + return idx - 1; + } else { + ioc->autop_too_slow_at = 0; + } + + return idx; +} + +/* + * Take the followings as input + * + * @bps maximum sequential throughput + * @seqiops maximum sequential 4k iops + * @randiops maximum random 4k iops + * + * and calculate the linear model cost coefficients. + * + * *@page per-page cost 1s / (@bps / 4096) + * *@seqio base cost of a seq IO max((1s / @seqiops) - *@page, 0) + * @randiops base cost of a rand IO max((1s / @randiops) - *@page, 0) + */ +static void calc_lcoefs(u64 bps, u64 seqiops, u64 randiops, + u64 *page, u64 *seqio, u64 *randio) +{ + u64 v; + + *page = *seqio = *randio = 0; + + if (bps) + *page = DIV64_U64_ROUND_UP(VTIME_PER_SEC, + DIV_ROUND_UP_ULL(bps, IOC_PAGE_SIZE)); + + if (seqiops) { + v = DIV64_U64_ROUND_UP(VTIME_PER_SEC, seqiops); + if (v > *page) + *seqio = v - *page; + } + + if (randiops) { + v = DIV64_U64_ROUND_UP(VTIME_PER_SEC, randiops); + if (v > *page) + *randio = v - *page; + } +} + +static void ioc_refresh_lcoefs(struct ioc *ioc) +{ + u64 *u = ioc->params.i_lcoefs; + u64 *c = ioc->params.lcoefs; + + calc_lcoefs(u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS], + &c[LCOEF_RPAGE], &c[LCOEF_RSEQIO], &c[LCOEF_RRANDIO]); + calc_lcoefs(u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS], + &c[LCOEF_WPAGE], &c[LCOEF_WSEQIO], &c[LCOEF_WRANDIO]); +} + +static bool ioc_refresh_params(struct ioc *ioc, bool force) +{ + const struct ioc_params *p; + int idx; + + lockdep_assert_held(&ioc->lock); + + idx = ioc_autop_idx(ioc); + p = &autop[idx]; + + if (idx == ioc->autop_idx && !force) + return false; + + if (idx != ioc->autop_idx) + atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC); + + ioc->autop_idx = idx; + ioc->autop_too_fast_at = 0; + ioc->autop_too_slow_at = 0; + + if (!ioc->user_qos_params) + memcpy(ioc->params.qos, p->qos, sizeof(p->qos)); + if (!ioc->user_cost_model) + memcpy(ioc->params.i_lcoefs, p->i_lcoefs, sizeof(p->i_lcoefs)); + + ioc_refresh_period_us(ioc); + ioc_refresh_lcoefs(ioc); + + ioc->vrate_min = DIV64_U64_ROUND_UP((u64)ioc->params.qos[QOS_MIN] * + VTIME_PER_USEC, MILLION); + ioc->vrate_max = div64_u64((u64)ioc->params.qos[QOS_MAX] * + VTIME_PER_USEC, MILLION); + + return true; +} + +/* take a snapshot of the current [v]time and vrate */ +static void ioc_now(struct ioc *ioc, struct ioc_now *now) +{ + unsigned seq; + + now->now_ns = ktime_get(); + now->now = ktime_to_us(now->now_ns); + now->vrate = atomic64_read(&ioc->vtime_rate); + + /* + * The current vtime is + * + * vtime at period start + (wallclock time since the start) * vrate + * + * As a consistent snapshot of `period_at_vtime` and `period_at` is + * needed, they're seqcount protected. + */ + do { + seq = read_seqcount_begin(&ioc->period_seqcount); + now->vnow = ioc->period_at_vtime + + (now->now - ioc->period_at) * now->vrate; + } while (read_seqcount_retry(&ioc->period_seqcount, seq)); +} + +static void ioc_start_period(struct ioc *ioc, struct ioc_now *now) +{ + lockdep_assert_held(&ioc->lock); + WARN_ON_ONCE(ioc->running != IOC_RUNNING); + + write_seqcount_begin(&ioc->period_seqcount); + ioc->period_at = now->now; + ioc->period_at_vtime = now->vnow; + write_seqcount_end(&ioc->period_seqcount); + + ioc->timer.expires = jiffies + usecs_to_jiffies(ioc->period_us); + add_timer(&ioc->timer); +} + +/* + * Update @iocg's `active` and `inuse` to @active and @inuse, update level + * weight sums and propagate upwards accordingly. + */ +static void __propagate_active_weight(struct ioc_gq *iocg, u32 active, u32 inuse) +{ + struct ioc *ioc = iocg->ioc; + int lvl; + + lockdep_assert_held(&ioc->lock); + + inuse = min(active, inuse); + + for (lvl = iocg->level - 1; lvl >= 0; lvl--) { + struct ioc_gq *parent = iocg->ancestors[lvl]; + struct ioc_gq *child = iocg->ancestors[lvl + 1]; + u32 parent_active = 0, parent_inuse = 0; + + /* update the level sums */ + parent->child_active_sum += (s32)(active - child->active); + parent->child_inuse_sum += (s32)(inuse - child->inuse); + /* apply the udpates */ + child->active = active; + child->inuse = inuse; + + /* + * The delta between inuse and active sums indicates that + * that much of weight is being given away. Parent's inuse + * and active should reflect the ratio. + */ + if (parent->child_active_sum) { + parent_active = parent->weight; + parent_inuse = DIV64_U64_ROUND_UP( + parent_active * parent->child_inuse_sum, + parent->child_active_sum); + } + + /* do we need to keep walking up? */ + if (parent_active == parent->active && + parent_inuse == parent->inuse) + break; + + active = parent_active; + inuse = parent_inuse; + } + + ioc->weights_updated = true; +} + +static void commit_active_weights(struct ioc *ioc) +{ + lockdep_assert_held(&ioc->lock); + + if (ioc->weights_updated) { + /* paired with rmb in current_hweight(), see there */ + smp_wmb(); + atomic_inc(&ioc->hweight_gen); + ioc->weights_updated = false; + } +} + +static void propagate_active_weight(struct ioc_gq *iocg, u32 active, u32 inuse) +{ + __propagate_active_weight(iocg, active, inuse); + commit_active_weights(iocg->ioc); +} + +static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep) +{ + struct ioc *ioc = iocg->ioc; + int lvl; + u32 hwa, hwi; + int ioc_gen; + + /* hot path - if uptodate, use cached */ + ioc_gen = atomic_read(&ioc->hweight_gen); + if (ioc_gen == iocg->hweight_gen) + goto out; + + /* + * Paired with wmb in commit_active_weights(). If we saw the + * updated hweight_gen, all the weight updates from + * __propagate_active_weight() are visible too. + * + * We can race with weight updates during calculation and get it + * wrong. However, hweight_gen would have changed and a future + * reader will recalculate and we're guaranteed to discard the + * wrong result soon. + */ + smp_rmb(); + + hwa = hwi = HWEIGHT_WHOLE; + for (lvl = 0; lvl <= iocg->level - 1; lvl++) { + struct ioc_gq *parent = iocg->ancestors[lvl]; + struct ioc_gq *child = iocg->ancestors[lvl + 1]; + u32 active_sum = READ_ONCE(parent->child_active_sum); + u32 inuse_sum = READ_ONCE(parent->child_inuse_sum); + u32 active = READ_ONCE(child->active); + u32 inuse = READ_ONCE(child->inuse); + + /* we can race with deactivations and either may read as zero */ + if (!active_sum || !inuse_sum) + continue; + + active_sum = max(active, active_sum); + hwa = hwa * active / active_sum; /* max 16bits * 10000 */ + + inuse_sum = max(inuse, inuse_sum); + hwi = hwi * inuse / inuse_sum; /* max 16bits * 10000 */ + } + + iocg->hweight_active = max_t(u32, hwa, 1); + iocg->hweight_inuse = max_t(u32, hwi, 1); + iocg->hweight_gen = ioc_gen; +out: + if (hw_activep) + *hw_activep = iocg->hweight_active; + if (hw_inusep) + *hw_inusep = iocg->hweight_inuse; +} + +static void weight_updated(struct ioc_gq *iocg) +{ + struct ioc *ioc = iocg->ioc; + struct blkcg_gq *blkg = iocg_to_blkg(iocg); + struct ioc_cgrp *iocc = blkcg_to_iocc(blkg->blkcg); + u32 weight; + + lockdep_assert_held(&ioc->lock); + + weight = iocg->cfg_weight ?: iocc->dfl_weight; + if (weight != iocg->weight && iocg->active) + propagate_active_weight(iocg, weight, + DIV64_U64_ROUND_UP(iocg->inuse * weight, iocg->weight)); + iocg->weight = weight; +} + +static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now) +{ + struct ioc *ioc = iocg->ioc; + u64 last_period, cur_period, max_period_delta; + u64 vtime, vmargin, vmin; + int i; + + /* + * If seem to be already active, just update the stamp to tell the + * timer that we're still active. We don't mind occassional races. + */ + if (!list_empty(&iocg->active_list)) { + ioc_now(ioc, now); + cur_period = atomic64_read(&ioc->cur_period); + if (atomic64_read(&iocg->active_period) != cur_period) + atomic64_set(&iocg->active_period, cur_period); + return true; + } + + /* racy check on internal node IOs, treat as root level IOs */ + if (iocg->child_active_sum) + return false; + + spin_lock_irq(&ioc->lock); + + ioc_now(ioc, now); + + /* update period */ + cur_period = atomic64_read(&ioc->cur_period); + last_period = atomic64_read(&iocg->active_period); + atomic64_set(&iocg->active_period, cur_period); + + /* already activated or breaking leaf-only constraint? */ + for (i = iocg->level; i > 0; i--) + if (!list_empty(&iocg->active_list)) + goto fail_unlock; + if (iocg->child_active_sum) + goto fail_unlock; + + /* + * vtime may wrap when vrate is raised substantially due to + * underestimated IO costs. Look at the period and ignore its + * vtime if the iocg has been idle for too long. Also, cap the + * budget it can start with to the margin. + */ + max_period_delta = DIV64_U64_ROUND_UP(VTIME_VALID_DUR, ioc->period_us); + vtime = atomic64_read(&iocg->vtime); + vmargin = ioc->margin_us * now->vrate; + vmin = now->vnow - vmargin; + + if (last_period + max_period_delta < cur_period || + time_before64(vtime, vmin)) { + atomic64_add(vmin - vtime, &iocg->vtime); + atomic64_add(vmin - vtime, &iocg->done_vtime); + vtime = vmin; + } + + /* + * Activate, propagate weight and start period timer if not + * running. Reset hweight_gen to avoid accidental match from + * wrapping. + */ + iocg->hweight_gen = atomic_read(&ioc->hweight_gen) - 1; + list_add(&iocg->active_list, &ioc->active_iocgs); + propagate_active_weight(iocg, iocg->weight, + iocg->last_inuse ?: iocg->weight); + + TRACE_IOCG_PATH(iocg_activate, iocg, now, + last_period, cur_period, vtime); + + iocg->last_vtime = vtime; + + if (ioc->running == IOC_IDLE) { + ioc->running = IOC_RUNNING; + ioc_start_period(ioc, now); + } + + spin_unlock_irq(&ioc->lock); + return true; + +fail_unlock: + spin_unlock_irq(&ioc->lock); + return false; +} + +static int iocg_wake_fn(struct wait_queue_entry *wq_entry, unsigned mode, + int flags, void *key) +{ + struct iocg_wait *wait = container_of(wq_entry, struct iocg_wait, wait); + struct iocg_wake_ctx *ctx = (struct iocg_wake_ctx *)key; + u64 cost = abs_cost_to_cost(wait->abs_cost, ctx->hw_inuse); + + ctx->vbudget -= cost; + + if (ctx->vbudget < 0) + return -1; + + iocg_commit_bio(ctx->iocg, wait->bio, cost); + + /* + * autoremove_wake_function() removes the wait entry only when it + * actually changed the task state. We want the wait always + * removed. Remove explicitly and use default_wake_function(). + */ + list_del_init(&wq_entry->entry); + wait->committed = true; + + default_wake_function(wq_entry, mode, flags, key); + return 0; +} + +static void iocg_kick_waitq(struct ioc_gq *iocg, struct ioc_now *now) +{ + struct ioc *ioc = iocg->ioc; + struct iocg_wake_ctx ctx = { .iocg = iocg }; + u64 margin_ns = (u64)(ioc->period_us * + WAITQ_TIMER_MARGIN_PCT / 100) * NSEC_PER_USEC; + u64 abs_vdebt, vdebt, vshortage, expires, oexpires; + s64 vbudget; + u32 hw_inuse; + + lockdep_assert_held(&iocg->waitq.lock); + + current_hweight(iocg, NULL, &hw_inuse); + vbudget = now->vnow - atomic64_read(&iocg->vtime); + + /* pay off debt */ + abs_vdebt = atomic64_read(&iocg->abs_vdebt); + vdebt = abs_cost_to_cost(abs_vdebt, hw_inuse); + if (vdebt && vbudget > 0) { + u64 delta = min_t(u64, vbudget, vdebt); + u64 abs_delta = min(cost_to_abs_cost(delta, hw_inuse), + abs_vdebt); + + atomic64_add(delta, &iocg->vtime); + atomic64_add(delta, &iocg->done_vtime); + atomic64_sub(abs_delta, &iocg->abs_vdebt); + if (WARN_ON_ONCE(atomic64_read(&iocg->abs_vdebt) < 0)) + atomic64_set(&iocg->abs_vdebt, 0); + } + + /* + * Wake up the ones which are due and see how much vtime we'll need + * for the next one. + */ + ctx.hw_inuse = hw_inuse; + ctx.vbudget = vbudget - vdebt; + __wake_up_locked_key(&iocg->waitq, TASK_NORMAL, &ctx); + if (!waitqueue_active(&iocg->waitq)) + return; + if (WARN_ON_ONCE(ctx.vbudget >= 0)) + return; + + /* determine next wakeup, add a quarter margin to guarantee chunking */ + vshortage = -ctx.vbudget; + expires = now->now_ns + + DIV64_U64_ROUND_UP(vshortage, now->vrate) * NSEC_PER_USEC; + expires += margin_ns / 4; + + /* if already active and close enough, don't bother */ + oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->waitq_timer)); + if (hrtimer_is_queued(&iocg->waitq_timer) && + abs(oexpires - expires) <= margin_ns / 4) + return; + + hrtimer_start_range_ns(&iocg->waitq_timer, ns_to_ktime(expires), + margin_ns / 4, HRTIMER_MODE_ABS); +} + +static enum hrtimer_restart iocg_waitq_timer_fn(struct hrtimer *timer) +{ + struct ioc_gq *iocg = container_of(timer, struct ioc_gq, waitq_timer); + struct ioc_now now; + unsigned long flags; + + ioc_now(iocg->ioc, &now); + + spin_lock_irqsave(&iocg->waitq.lock, flags); + iocg_kick_waitq(iocg, &now); + spin_unlock_irqrestore(&iocg->waitq.lock, flags); + + return HRTIMER_NORESTART; +} + +static void iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now, u64 cost) +{ + struct ioc *ioc = iocg->ioc; + struct blkcg_gq *blkg = iocg_to_blkg(iocg); + u64 vtime = atomic64_read(&iocg->vtime); + u64 vmargin = ioc->margin_us * now->vrate; + u64 margin_ns = ioc->margin_us * NSEC_PER_USEC; + u64 expires, oexpires; + u32 hw_inuse; + + /* debt-adjust vtime */ + current_hweight(iocg, NULL, &hw_inuse); + vtime += abs_cost_to_cost(atomic64_read(&iocg->abs_vdebt), hw_inuse); + + /* clear or maintain depending on the overage */ + if (time_before_eq64(vtime, now->vnow)) { + blkcg_clear_delay(blkg); + return; + } + if (!atomic_read(&blkg->use_delay) && + time_before_eq64(vtime, now->vnow + vmargin)) + return; + + /* use delay */ + if (cost) { + u64 cost_ns = DIV64_U64_ROUND_UP(cost * NSEC_PER_USEC, + now->vrate); + blkcg_add_delay(blkg, now->now_ns, cost_ns); + } + blkcg_use_delay(blkg); + + expires = now->now_ns + DIV64_U64_ROUND_UP(vtime - now->vnow, + now->vrate) * NSEC_PER_USEC; + + /* if already active and close enough, don't bother */ + oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->delay_timer)); + if (hrtimer_is_queued(&iocg->delay_timer) && + abs(oexpires - expires) <= margin_ns / 4) + return; + + hrtimer_start_range_ns(&iocg->delay_timer, ns_to_ktime(expires), + margin_ns / 4, HRTIMER_MODE_ABS); +} + +static enum hrtimer_restart iocg_delay_timer_fn(struct hrtimer *timer) +{ + struct ioc_gq *iocg = container_of(timer, struct ioc_gq, delay_timer); + struct ioc_now now; + + ioc_now(iocg->ioc, &now); + iocg_kick_delay(iocg, &now, 0); + + return HRTIMER_NORESTART; +} + +static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p) +{ + u32 nr_met[2] = { }; + u32 nr_missed[2] = { }; + u64 rq_wait_ns = 0; + int cpu, rw; + + for_each_online_cpu(cpu) { + struct ioc_pcpu_stat *stat = per_cpu_ptr(ioc->pcpu_stat, cpu); + u64 this_rq_wait_ns; + + for (rw = READ; rw <= WRITE; rw++) { + u32 this_met = READ_ONCE(stat->missed[rw].nr_met); + u32 this_missed = READ_ONCE(stat->missed[rw].nr_missed); + + nr_met[rw] += this_met - stat->missed[rw].last_met; + nr_missed[rw] += this_missed - stat->missed[rw].last_missed; + stat->missed[rw].last_met = this_met; + stat->missed[rw].last_missed = this_missed; + } + + this_rq_wait_ns = READ_ONCE(stat->rq_wait_ns); + rq_wait_ns += this_rq_wait_ns - stat->last_rq_wait_ns; + stat->last_rq_wait_ns = this_rq_wait_ns; + } + + for (rw = READ; rw <= WRITE; rw++) { + if (nr_met[rw] + nr_missed[rw]) + missed_ppm_ar[rw] = + DIV64_U64_ROUND_UP((u64)nr_missed[rw] * MILLION, + nr_met[rw] + nr_missed[rw]); + else + missed_ppm_ar[rw] = 0; + } + + *rq_wait_pct_p = div64_u64(rq_wait_ns * 100, + ioc->period_us * NSEC_PER_USEC); +} + +/* was iocg idle this period? */ +static bool iocg_is_idle(struct ioc_gq *iocg) +{ + struct ioc *ioc = iocg->ioc; + + /* did something get issued this period? */ + if (atomic64_read(&iocg->active_period) == + atomic64_read(&ioc->cur_period)) + return false; + + /* is something in flight? */ + if (atomic64_read(&iocg->done_vtime) < atomic64_read(&iocg->vtime)) + return false; + + return true; +} + +/* returns usage with margin added if surplus is large enough */ +static u32 surplus_adjusted_hweight_inuse(u32 usage, u32 hw_inuse) +{ + /* add margin */ + usage = DIV_ROUND_UP(usage * SURPLUS_SCALE_PCT, 100); + usage += SURPLUS_SCALE_ABS; + + /* don't bother if the surplus is too small */ + if (usage + SURPLUS_MIN_ADJ_DELTA > hw_inuse) + return 0; + + return usage; +} + +static void ioc_timer_fn(struct timer_list *timer) +{ + struct ioc *ioc = container_of(timer, struct ioc, timer); + struct ioc_gq *iocg, *tiocg; + struct ioc_now now; + int nr_surpluses = 0, nr_shortages = 0, nr_lagging = 0; + u32 ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM]; + u32 ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM]; + u32 missed_ppm[2], rq_wait_pct; + u64 period_vtime; + int i; + + /* how were the latencies during the period? */ + ioc_lat_stat(ioc, missed_ppm, &rq_wait_pct); + + /* take care of active iocgs */ + spin_lock_irq(&ioc->lock); + + ioc_now(ioc, &now); + + period_vtime = now.vnow - ioc->period_at_vtime; + if (WARN_ON_ONCE(!period_vtime)) { + spin_unlock_irq(&ioc->lock); + return; + } + + /* + * Waiters determine the sleep durations based on the vrate they + * saw at the time of sleep. If vrate has increased, some waiters + * could be sleeping for too long. Wake up tardy waiters which + * should have woken up in the last period and expire idle iocgs. + */ + list_for_each_entry_safe(iocg, tiocg, &ioc->active_iocgs, active_list) { + if (!waitqueue_active(&iocg->waitq) && + !atomic64_read(&iocg->abs_vdebt) && !iocg_is_idle(iocg)) + continue; + + spin_lock(&iocg->waitq.lock); + + if (waitqueue_active(&iocg->waitq) || + atomic64_read(&iocg->abs_vdebt)) { + /* might be oversleeping vtime / hweight changes, kick */ + iocg_kick_waitq(iocg, &now); + iocg_kick_delay(iocg, &now, 0); + } else if (iocg_is_idle(iocg)) { + /* no waiter and idle, deactivate */ + iocg->last_inuse = iocg->inuse; + __propagate_active_weight(iocg, 0, 0); + list_del_init(&iocg->active_list); + } + + spin_unlock(&iocg->waitq.lock); + } + commit_active_weights(ioc); + + /* calc usages and see whether some weights need to be moved around */ + list_for_each_entry(iocg, &ioc->active_iocgs, active_list) { + u64 vdone, vtime, vusage, vmargin, vmin; + u32 hw_active, hw_inuse, usage; + + /* + * Collect unused and wind vtime closer to vnow to prevent + * iocgs from accumulating a large amount of budget. + */ + vdone = atomic64_read(&iocg->done_vtime); + vtime = atomic64_read(&iocg->vtime); + current_hweight(iocg, &hw_active, &hw_inuse); + + /* + * Latency QoS detection doesn't account for IOs which are + * in-flight for longer than a period. Detect them by + * comparing vdone against period start. If lagging behind + * IOs from past periods, don't increase vrate. + */ + if (!atomic_read(&iocg_to_blkg(iocg)->use_delay) && + time_after64(vtime, vdone) && + time_after64(vtime, now.vnow - + MAX_LAGGING_PERIODS * period_vtime) && + time_before64(vdone, now.vnow - period_vtime)) + nr_lagging++; + + if (waitqueue_active(&iocg->waitq)) + vusage = now.vnow - iocg->last_vtime; + else if (time_before64(iocg->last_vtime, vtime)) + vusage = vtime - iocg->last_vtime; + else + vusage = 0; + + iocg->last_vtime += vusage; + /* + * Factor in in-flight vtime into vusage to avoid + * high-latency completions appearing as idle. This should + * be done after the above ->last_time adjustment. + */ + vusage = max(vusage, vtime - vdone); + + /* calculate hweight based usage ratio and record */ + if (vusage) { + usage = DIV64_U64_ROUND_UP(vusage * hw_inuse, + period_vtime); + iocg->usage_idx = (iocg->usage_idx + 1) % NR_USAGE_SLOTS; + iocg->usages[iocg->usage_idx] = usage; + } else { + usage = 0; + } + + /* see whether there's surplus vtime */ + vmargin = ioc->margin_us * now.vrate; + vmin = now.vnow - vmargin; + + iocg->has_surplus = false; + + if (!waitqueue_active(&iocg->waitq) && + time_before64(vtime, vmin)) { + u64 delta = vmin - vtime; + + /* throw away surplus vtime */ + atomic64_add(delta, &iocg->vtime); + atomic64_add(delta, &iocg->done_vtime); + iocg->last_vtime += delta; + /* if usage is sufficiently low, maybe it can donate */ + if (surplus_adjusted_hweight_inuse(usage, hw_inuse)) { + iocg->has_surplus = true; + nr_surpluses++; + } + } else if (hw_inuse < hw_active) { + u32 new_hwi, new_inuse; + + /* was donating but might need to take back some */ + if (waitqueue_active(&iocg->waitq)) { + new_hwi = hw_active; + } else { + new_hwi = max(hw_inuse, + usage * SURPLUS_SCALE_PCT / 100 + + SURPLUS_SCALE_ABS); + } + + new_inuse = div64_u64((u64)iocg->inuse * new_hwi, + hw_inuse); + new_inuse = clamp_t(u32, new_inuse, 1, iocg->active); + + if (new_inuse > iocg->inuse) { + TRACE_IOCG_PATH(inuse_takeback, iocg, &now, + iocg->inuse, new_inuse, + hw_inuse, new_hwi); + __propagate_active_weight(iocg, iocg->weight, + new_inuse); + } + } else { + /* genuninely out of vtime */ + nr_shortages++; + } + } + + if (!nr_shortages || !nr_surpluses) + goto skip_surplus_transfers; + + /* there are both shortages and surpluses, transfer surpluses */ + list_for_each_entry(iocg, &ioc->active_iocgs, active_list) { + u32 usage, hw_active, hw_inuse, new_hwi, new_inuse; + int nr_valid = 0; + + if (!iocg->has_surplus) + continue; + + /* base the decision on max historical usage */ + for (i = 0, usage = 0; i < NR_USAGE_SLOTS; i++) { + if (iocg->usages[i]) { + usage = max(usage, iocg->usages[i]); + nr_valid++; + } + } + if (nr_valid < MIN_VALID_USAGES) + continue; + + current_hweight(iocg, &hw_active, &hw_inuse); + new_hwi = surplus_adjusted_hweight_inuse(usage, hw_inuse); + if (!new_hwi) + continue; + + new_inuse = DIV64_U64_ROUND_UP((u64)iocg->inuse * new_hwi, + hw_inuse); + if (new_inuse < iocg->inuse) { + TRACE_IOCG_PATH(inuse_giveaway, iocg, &now, + iocg->inuse, new_inuse, + hw_inuse, new_hwi); + __propagate_active_weight(iocg, iocg->weight, new_inuse); + } + } +skip_surplus_transfers: + commit_active_weights(ioc); + + /* + * If q is getting clogged or we're missing too much, we're issuing + * too much IO and should lower vtime rate. If we're not missing + * and experiencing shortages but not surpluses, we're too stingy + * and should increase vtime rate. + */ + if (rq_wait_pct > RQ_WAIT_BUSY_PCT || + missed_ppm[READ] > ppm_rthr || + missed_ppm[WRITE] > ppm_wthr) { + ioc->busy_level = max(ioc->busy_level, 0); + ioc->busy_level++; + } else if (nr_lagging) { + ioc->busy_level = max(ioc->busy_level, 0); + } else if (nr_shortages && !nr_surpluses && + rq_wait_pct <= RQ_WAIT_BUSY_PCT * UNBUSY_THR_PCT / 100 && + missed_ppm[READ] <= ppm_rthr * UNBUSY_THR_PCT / 100 && + missed_ppm[WRITE] <= ppm_wthr * UNBUSY_THR_PCT / 100) { + ioc->busy_level = min(ioc->busy_level, 0); + ioc->busy_level--; + } else { + ioc->busy_level = 0; + } + + ioc->busy_level = clamp(ioc->busy_level, -1000, 1000); + + if (ioc->busy_level) { + u64 vrate = atomic64_read(&ioc->vtime_rate); + u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max; + + /* rq_wait signal is always reliable, ignore user vrate_min */ + if (rq_wait_pct > RQ_WAIT_BUSY_PCT) + vrate_min = VRATE_MIN; + + /* + * If vrate is out of bounds, apply clamp gradually as the + * bounds can change abruptly. Otherwise, apply busy_level + * based adjustment. + */ + if (vrate < vrate_min) { + vrate = div64_u64(vrate * (100 + VRATE_CLAMP_ADJ_PCT), + 100); + vrate = min(vrate, vrate_min); + } else if (vrate > vrate_max) { + vrate = div64_u64(vrate * (100 - VRATE_CLAMP_ADJ_PCT), + 100); + vrate = max(vrate, vrate_max); + } else { + int idx = min_t(int, abs(ioc->busy_level), + ARRAY_SIZE(vrate_adj_pct) - 1); + u32 adj_pct = vrate_adj_pct[idx]; + + if (ioc->busy_level > 0) + adj_pct = 100 - adj_pct; + else + adj_pct = 100 + adj_pct; + + vrate = clamp(DIV64_U64_ROUND_UP(vrate * adj_pct, 100), + vrate_min, vrate_max); + } + + trace_iocost_ioc_vrate_adj(ioc, vrate, &missed_ppm, rq_wait_pct, + nr_lagging, nr_shortages, + nr_surpluses); + + atomic64_set(&ioc->vtime_rate, vrate); + ioc->inuse_margin_vtime = DIV64_U64_ROUND_UP( + ioc->period_us * vrate * INUSE_MARGIN_PCT, 100); + } + + ioc_refresh_params(ioc, false); + + /* + * This period is done. Move onto the next one. If nothing's + * going on with the device, stop the timer. + */ + atomic64_inc(&ioc->cur_period); + + if (ioc->running != IOC_STOP) { + if (!list_empty(&ioc->active_iocgs)) { + ioc_start_period(ioc, &now); + } else { + ioc->busy_level = 0; + ioc->running = IOC_IDLE; + } + } + + spin_unlock_irq(&ioc->lock); +} + +static void calc_vtime_cost_builtin(struct bio *bio, struct ioc_gq *iocg, + bool is_merge, u64 *costp) +{ + struct ioc *ioc = iocg->ioc; + u64 coef_seqio, coef_randio, coef_page; + u64 pages = max_t(u64, bio_sectors(bio) >> IOC_SECT_TO_PAGE_SHIFT, 1); + u64 seek_pages = 0; + u64 cost = 0; + + switch (bio_op(bio)) { + case REQ_OP_READ: + coef_seqio = ioc->params.lcoefs[LCOEF_RSEQIO]; + coef_randio = ioc->params.lcoefs[LCOEF_RRANDIO]; + coef_page = ioc->params.lcoefs[LCOEF_RPAGE]; + break; + case REQ_OP_WRITE: + coef_seqio = ioc->params.lcoefs[LCOEF_WSEQIO]; + coef_randio = ioc->params.lcoefs[LCOEF_WRANDIO]; + coef_page = ioc->params.lcoefs[LCOEF_WPAGE]; + break; + default: + goto out; + } + + if (iocg->cursor) { + seek_pages = abs(bio->bi_iter.bi_sector - iocg->cursor); + seek_pages >>= IOC_SECT_TO_PAGE_SHIFT; + } + + if (!is_merge) { + if (seek_pages > LCOEF_RANDIO_PAGES) { + cost += coef_randio; + } else { + cost += coef_seqio; + } + } + cost += pages * coef_page; +out: + *costp = cost; +} + +static u64 calc_vtime_cost(struct bio *bio, struct ioc_gq *iocg, bool is_merge) +{ + u64 cost; + + calc_vtime_cost_builtin(bio, iocg, is_merge, &cost); + return cost; +} + +static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio) +{ + struct blkcg_gq *blkg = bio->bi_blkg; + struct ioc *ioc = rqos_to_ioc(rqos); + struct ioc_gq *iocg = blkg_to_iocg(blkg); + struct ioc_now now; + struct iocg_wait wait; + u32 hw_active, hw_inuse; + u64 abs_cost, cost, vtime; + + /* bypass IOs if disabled or for root cgroup */ + if (!ioc->enabled || !iocg->level) + return; + + /* always activate so that even 0 cost IOs get protected to some level */ + if (!iocg_activate(iocg, &now)) + return; + + /* calculate the absolute vtime cost */ + abs_cost = calc_vtime_cost(bio, iocg, false); + if (!abs_cost) + return; + + iocg->cursor = bio_end_sector(bio); + + vtime = atomic64_read(&iocg->vtime); + current_hweight(iocg, &hw_active, &hw_inuse); + + if (hw_inuse < hw_active && + time_after_eq64(vtime + ioc->inuse_margin_vtime, now.vnow)) { + TRACE_IOCG_PATH(inuse_reset, iocg, &now, + iocg->inuse, iocg->weight, hw_inuse, hw_active); + spin_lock_irq(&ioc->lock); + propagate_active_weight(iocg, iocg->weight, iocg->weight); + spin_unlock_irq(&ioc->lock); + current_hweight(iocg, &hw_active, &hw_inuse); + } + + cost = abs_cost_to_cost(abs_cost, hw_inuse); + + /* + * If no one's waiting and within budget, issue right away. The + * tests are racy but the races aren't systemic - we only miss once + * in a while which is fine. + */ + if (!waitqueue_active(&iocg->waitq) && + !atomic64_read(&iocg->abs_vdebt) && + time_before_eq64(vtime + cost, now.vnow)) { + iocg_commit_bio(iocg, bio, cost); + return; + } + + /* + * We're over budget. If @bio has to be issued regardless, + * remember the abs_cost instead of advancing vtime. + * iocg_kick_waitq() will pay off the debt before waking more IOs. + * This way, the debt is continuously paid off each period with the + * actual budget available to the cgroup. If we just wound vtime, + * we would incorrectly use the current hw_inuse for the entire + * amount which, for example, can lead to the cgroup staying + * blocked for a long time even with substantially raised hw_inuse. + */ + if (bio_issue_as_root_blkg(bio) || fatal_signal_pending(current)) { + atomic64_add(abs_cost, &iocg->abs_vdebt); + iocg_kick_delay(iocg, &now, cost); + return; + } + + /* + * Append self to the waitq and schedule the wakeup timer if we're + * the first waiter. The timer duration is calculated based on the + * current vrate. vtime and hweight changes can make it too short + * or too long. Each wait entry records the absolute cost it's + * waiting for to allow re-evaluation using a custom wait entry. + * + * If too short, the timer simply reschedules itself. If too long, + * the period timer will notice and trigger wakeups. + * + * All waiters are on iocg->waitq and the wait states are + * synchronized using waitq.lock. + */ + spin_lock_irq(&iocg->waitq.lock); + + /* + * We activated above but w/o any synchronization. Deactivation is + * synchronized with waitq.lock and we won't get deactivated as + * long as we're waiting, so we're good if we're activated here. + * In the unlikely case that we are deactivated, just issue the IO. + */ + if (unlikely(list_empty(&iocg->active_list))) { + spin_unlock_irq(&iocg->waitq.lock); + iocg_commit_bio(iocg, bio, cost); + return; + } + + init_waitqueue_func_entry(&wait.wait, iocg_wake_fn); + wait.wait.private = current; + wait.bio = bio; + wait.abs_cost = abs_cost; + wait.committed = false; /* will be set true by waker */ + + __add_wait_queue_entry_tail(&iocg->waitq, &wait.wait); + iocg_kick_waitq(iocg, &now); + + spin_unlock_irq(&iocg->waitq.lock); + + while (true) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (wait.committed) + break; + io_schedule(); + } + + /* waker already committed us, proceed */ + finish_wait(&iocg->waitq, &wait.wait); +} + +static void ioc_rqos_merge(struct rq_qos *rqos, struct request *rq, + struct bio *bio) +{ + struct ioc_gq *iocg = blkg_to_iocg(bio->bi_blkg); + struct ioc *ioc = iocg->ioc; + sector_t bio_end = bio_end_sector(bio); + struct ioc_now now; + u32 hw_inuse; + u64 abs_cost, cost; + + /* bypass if disabled or for root cgroup */ + if (!ioc->enabled || !iocg->level) + return; + + abs_cost = calc_vtime_cost(bio, iocg, true); + if (!abs_cost) + return; + + ioc_now(ioc, &now); + current_hweight(iocg, NULL, &hw_inuse); + cost = abs_cost_to_cost(abs_cost, hw_inuse); + + /* update cursor if backmerging into the request at the cursor */ + if (blk_rq_pos(rq) < bio_end && + blk_rq_pos(rq) + blk_rq_sectors(rq) == iocg->cursor) + iocg->cursor = bio_end; + + /* + * Charge if there's enough vtime budget and the existing request + * has cost assigned. Otherwise, account it as debt. See debt + * handling in ioc_rqos_throttle() for details. + */ + if (rq->bio && rq->bio->bi_iocost_cost && + time_before_eq64(atomic64_read(&iocg->vtime) + cost, now.vnow)) + iocg_commit_bio(iocg, bio, cost); + else + atomic64_add(abs_cost, &iocg->abs_vdebt); +} + +static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio) +{ + struct ioc_gq *iocg = blkg_to_iocg(bio->bi_blkg); + + if (iocg && bio->bi_iocost_cost) + atomic64_add(bio->bi_iocost_cost, &iocg->done_vtime); +} + +static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq) +{ + struct ioc *ioc = rqos_to_ioc(rqos); + u64 on_q_ns, rq_wait_ns; + int pidx, rw; + + if (!ioc->enabled || !rq->alloc_time_ns || !rq->start_time_ns) + return; + + switch (req_op(rq) & REQ_OP_MASK) { + case REQ_OP_READ: + pidx = QOS_RLAT; + rw = READ; + break; + case REQ_OP_WRITE: + pidx = QOS_WLAT; + rw = WRITE; + break; + default: + return; + } + + on_q_ns = ktime_get_ns() - rq->alloc_time_ns; + rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns; + + if (on_q_ns <= ioc->params.qos[pidx] * NSEC_PER_USEC) + this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_met); + else + this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_missed); + + this_cpu_add(ioc->pcpu_stat->rq_wait_ns, rq_wait_ns); +} + +static void ioc_rqos_queue_depth_changed(struct rq_qos *rqos) +{ + struct ioc *ioc = rqos_to_ioc(rqos); + + spin_lock_irq(&ioc->lock); + ioc_refresh_params(ioc, false); + spin_unlock_irq(&ioc->lock); +} + +static void ioc_rqos_exit(struct rq_qos *rqos) +{ + struct ioc *ioc = rqos_to_ioc(rqos); + + blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost); + + spin_lock_irq(&ioc->lock); + ioc->running = IOC_STOP; + spin_unlock_irq(&ioc->lock); + + del_timer_sync(&ioc->timer); + free_percpu(ioc->pcpu_stat); + kfree(ioc); +} + +static struct rq_qos_ops ioc_rqos_ops = { + .throttle = ioc_rqos_throttle, + .merge = ioc_rqos_merge, + .done_bio = ioc_rqos_done_bio, + .done = ioc_rqos_done, + .queue_depth_changed = ioc_rqos_queue_depth_changed, + .exit = ioc_rqos_exit, +}; + +static int blk_iocost_init(struct request_queue *q) +{ + struct ioc *ioc; + struct rq_qos *rqos; + int ret; + + ioc = kzalloc(sizeof(*ioc), GFP_KERNEL); + if (!ioc) + return -ENOMEM; + + ioc->pcpu_stat = alloc_percpu(struct ioc_pcpu_stat); + if (!ioc->pcpu_stat) { + kfree(ioc); + return -ENOMEM; + } + + rqos = &ioc->rqos; + rqos->id = RQ_QOS_COST; + rqos->ops = &ioc_rqos_ops; + rqos->q = q; + + spin_lock_init(&ioc->lock); + timer_setup(&ioc->timer, ioc_timer_fn, 0); + INIT_LIST_HEAD(&ioc->active_iocgs); + + ioc->running = IOC_IDLE; + atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC); + seqcount_init(&ioc->period_seqcount); + ioc->period_at = ktime_to_us(ktime_get()); + atomic64_set(&ioc->cur_period, 0); + atomic_set(&ioc->hweight_gen, 0); + + spin_lock_irq(&ioc->lock); + ioc->autop_idx = AUTOP_INVALID; + ioc_refresh_params(ioc, true); + spin_unlock_irq(&ioc->lock); + + rq_qos_add(q, rqos); + ret = blkcg_activate_policy(q, &blkcg_policy_iocost); + if (ret) { + rq_qos_del(q, rqos); + free_percpu(ioc->pcpu_stat); + kfree(ioc); + return ret; + } + return 0; +} + +static struct blkcg_policy_data *ioc_cpd_alloc(gfp_t gfp) +{ + struct ioc_cgrp *iocc; + + iocc = kzalloc(sizeof(struct ioc_cgrp), gfp); + if (!iocc) + return NULL; + + iocc->dfl_weight = CGROUP_WEIGHT_DFL; + return &iocc->cpd; +} + +static void ioc_cpd_free(struct blkcg_policy_data *cpd) +{ + kfree(container_of(cpd, struct ioc_cgrp, cpd)); +} + +static struct blkg_policy_data *ioc_pd_alloc(gfp_t gfp, struct request_queue *q, + struct blkcg *blkcg) +{ + int levels = blkcg->css.cgroup->level + 1; + struct ioc_gq *iocg; + + iocg = kzalloc_node(sizeof(*iocg) + levels * sizeof(iocg->ancestors[0]), + gfp, q->node); + if (!iocg) + return NULL; + + return &iocg->pd; +} + +static void ioc_pd_init(struct blkg_policy_data *pd) +{ + struct ioc_gq *iocg = pd_to_iocg(pd); + struct blkcg_gq *blkg = pd_to_blkg(&iocg->pd); + struct ioc *ioc = q_to_ioc(blkg->q); + struct ioc_now now; + struct blkcg_gq *tblkg; + unsigned long flags; + + ioc_now(ioc, &now); + + iocg->ioc = ioc; + atomic64_set(&iocg->vtime, now.vnow); + atomic64_set(&iocg->done_vtime, now.vnow); + atomic64_set(&iocg->abs_vdebt, 0); + atomic64_set(&iocg->active_period, atomic64_read(&ioc->cur_period)); + INIT_LIST_HEAD(&iocg->active_list); + iocg->hweight_active = HWEIGHT_WHOLE; + iocg->hweight_inuse = HWEIGHT_WHOLE; + + init_waitqueue_head(&iocg->waitq); + hrtimer_init(&iocg->waitq_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); + iocg->waitq_timer.function = iocg_waitq_timer_fn; + hrtimer_init(&iocg->delay_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); + iocg->delay_timer.function = iocg_delay_timer_fn; + + iocg->level = blkg->blkcg->css.cgroup->level; + + for (tblkg = blkg; tblkg; tblkg = tblkg->parent) { + struct ioc_gq *tiocg = blkg_to_iocg(tblkg); + iocg->ancestors[tiocg->level] = tiocg; + } + + spin_lock_irqsave(&ioc->lock, flags); + weight_updated(iocg); + spin_unlock_irqrestore(&ioc->lock, flags); +} + +static void ioc_pd_free(struct blkg_policy_data *pd) +{ + struct ioc_gq *iocg = pd_to_iocg(pd); + struct ioc *ioc = iocg->ioc; + + if (ioc) { + spin_lock(&ioc->lock); + if (!list_empty(&iocg->active_list)) { + propagate_active_weight(iocg, 0, 0); + list_del_init(&iocg->active_list); + } + spin_unlock(&ioc->lock); + + hrtimer_cancel(&iocg->waitq_timer); + hrtimer_cancel(&iocg->delay_timer); + } + kfree(iocg); +} + +static u64 ioc_weight_prfill(struct seq_file *sf, struct blkg_policy_data *pd, + int off) +{ + const char *dname = blkg_dev_name(pd->blkg); + struct ioc_gq *iocg = pd_to_iocg(pd); + + if (dname && iocg->cfg_weight) + seq_printf(sf, "%s %u\n", dname, iocg->cfg_weight); + return 0; +} + + +static int ioc_weight_show(struct seq_file *sf, void *v) +{ + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); + struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg); + + seq_printf(sf, "default %u\n", iocc->dfl_weight); + blkcg_print_blkgs(sf, blkcg, ioc_weight_prfill, + &blkcg_policy_iocost, seq_cft(sf)->private, false); + return 0; +} + +static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct blkcg *blkcg = css_to_blkcg(of_css(of)); + struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg); + struct blkg_conf_ctx ctx; + struct ioc_gq *iocg; + u32 v; + int ret; + + if (!strchr(buf, ':')) { + struct blkcg_gq *blkg; + + if (!sscanf(buf, "default %u", &v) && !sscanf(buf, "%u", &v)) + return -EINVAL; + + if (v < CGROUP_WEIGHT_MIN || v > CGROUP_WEIGHT_MAX) + return -EINVAL; + + spin_lock(&blkcg->lock); + iocc->dfl_weight = v; + hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) { + struct ioc_gq *iocg = blkg_to_iocg(blkg); + + if (iocg) { + spin_lock_irq(&iocg->ioc->lock); + weight_updated(iocg); + spin_unlock_irq(&iocg->ioc->lock); + } + } + spin_unlock(&blkcg->lock); + + return nbytes; + } + + ret = blkg_conf_prep(blkcg, &blkcg_policy_iocost, buf, &ctx); + if (ret) + return ret; + + iocg = blkg_to_iocg(ctx.blkg); + + if (!strncmp(ctx.body, "default", 7)) { + v = 0; + } else { + if (!sscanf(ctx.body, "%u", &v)) + goto einval; + if (v < CGROUP_WEIGHT_MIN || v > CGROUP_WEIGHT_MAX) + goto einval; + } + + spin_lock_irq(&iocg->ioc->lock); + iocg->cfg_weight = v; + weight_updated(iocg); + spin_unlock_irq(&iocg->ioc->lock); + + blkg_conf_finish(&ctx); + return nbytes; + +einval: + blkg_conf_finish(&ctx); + return -EINVAL; +} + +static u64 ioc_qos_prfill(struct seq_file *sf, struct blkg_policy_data *pd, + int off) +{ + const char *dname = blkg_dev_name(pd->blkg); + struct ioc *ioc = pd_to_iocg(pd)->ioc; + + if (!dname) + return 0; + + seq_printf(sf, "%s enable=%d ctrl=%s rpct=%u.%02u rlat=%u wpct=%u.%02u wlat=%u min=%u.%02u max=%u.%02u\n", + dname, ioc->enabled, ioc->user_qos_params ? "user" : "auto", + ioc->params.qos[QOS_RPPM] / 10000, + ioc->params.qos[QOS_RPPM] % 10000 / 100, + ioc->params.qos[QOS_RLAT], + ioc->params.qos[QOS_WPPM] / 10000, + ioc->params.qos[QOS_WPPM] % 10000 / 100, + ioc->params.qos[QOS_WLAT], + ioc->params.qos[QOS_MIN] / 10000, + ioc->params.qos[QOS_MIN] % 10000 / 100, + ioc->params.qos[QOS_MAX] / 10000, + ioc->params.qos[QOS_MAX] % 10000 / 100); + return 0; +} + +static int ioc_qos_show(struct seq_file *sf, void *v) +{ + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); + + blkcg_print_blkgs(sf, blkcg, ioc_qos_prfill, + &blkcg_policy_iocost, seq_cft(sf)->private, false); + return 0; +} + +static const match_table_t qos_ctrl_tokens = { + { QOS_ENABLE, "enable=%u" }, + { QOS_CTRL, "ctrl=%s" }, + { NR_QOS_CTRL_PARAMS, NULL }, +}; + +static const match_table_t qos_tokens = { + { QOS_RPPM, "rpct=%s" }, + { QOS_RLAT, "rlat=%u" }, + { QOS_WPPM, "wpct=%s" }, + { QOS_WLAT, "wlat=%u" }, + { QOS_MIN, "min=%s" }, + { QOS_MAX, "max=%s" }, + { NR_QOS_PARAMS, NULL }, +}; + +static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input, + size_t nbytes, loff_t off) +{ + struct gendisk *disk; + struct ioc *ioc; + u32 qos[NR_QOS_PARAMS]; + bool enable, user; + char *p; + int ret; + + disk = blkcg_conf_get_disk(&input); + if (IS_ERR(disk)) + return PTR_ERR(disk); + + ioc = q_to_ioc(disk->queue); + if (!ioc) { + ret = blk_iocost_init(disk->queue); + if (ret) + goto err; + ioc = q_to_ioc(disk->queue); + } + + spin_lock_irq(&ioc->lock); + memcpy(qos, ioc->params.qos, sizeof(qos)); + enable = ioc->enabled; + user = ioc->user_qos_params; + spin_unlock_irq(&ioc->lock); + + while ((p = strsep(&input, " \t\n"))) { + substring_t args[MAX_OPT_ARGS]; + char buf[32]; + int tok; + s64 v; + + if (!*p) + continue; + + switch (match_token(p, qos_ctrl_tokens, args)) { + case QOS_ENABLE: + match_u64(&args[0], &v); + enable = v; + continue; + case QOS_CTRL: + match_strlcpy(buf, &args[0], sizeof(buf)); + if (!strcmp(buf, "auto")) + user = false; + else if (!strcmp(buf, "user")) + user = true; + else + goto einval; + continue; + } + + tok = match_token(p, qos_tokens, args); + switch (tok) { + case QOS_RPPM: + case QOS_WPPM: + if (match_strlcpy(buf, &args[0], sizeof(buf)) >= + sizeof(buf)) + goto einval; + if (cgroup_parse_float(buf, 2, &v)) + goto einval; + if (v < 0 || v > 10000) + goto einval; + qos[tok] = v * 100; + break; + case QOS_RLAT: + case QOS_WLAT: + if (match_u64(&args[0], &v)) + goto einval; + qos[tok] = v; + break; + case QOS_MIN: + case QOS_MAX: + if (match_strlcpy(buf, &args[0], sizeof(buf)) >= + sizeof(buf)) + goto einval; + if (cgroup_parse_float(buf, 2, &v)) + goto einval; + if (v < 0) + goto einval; + qos[tok] = clamp_t(s64, v * 100, + VRATE_MIN_PPM, VRATE_MAX_PPM); + break; + default: + goto einval; + } + user = true; + } + + if (qos[QOS_MIN] > qos[QOS_MAX]) + goto einval; + + spin_lock_irq(&ioc->lock); + + if (enable) { + blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, ioc->rqos.q); + ioc->enabled = true; + } else { + blk_queue_flag_clear(QUEUE_FLAG_RQ_ALLOC_TIME, ioc->rqos.q); + ioc->enabled = false; + } + + if (user) { + memcpy(ioc->params.qos, qos, sizeof(qos)); + ioc->user_qos_params = true; + } else { + ioc->user_qos_params = false; + } + + ioc_refresh_params(ioc, true); + spin_unlock_irq(&ioc->lock); + + put_disk_and_module(disk); + return nbytes; +einval: + ret = -EINVAL; +err: + put_disk_and_module(disk); + return ret; +} + +static u64 ioc_cost_model_prfill(struct seq_file *sf, + struct blkg_policy_data *pd, int off) +{ + const char *dname = blkg_dev_name(pd->blkg); + struct ioc *ioc = pd_to_iocg(pd)->ioc; + u64 *u = ioc->params.i_lcoefs; + + if (!dname) + return 0; + + seq_printf(sf, "%s ctrl=%s model=linear " + "rbps=%llu rseqiops=%llu rrandiops=%llu " + "wbps=%llu wseqiops=%llu wrandiops=%llu\n", + dname, ioc->user_cost_model ? "user" : "auto", + u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS], + u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS]); + return 0; +} + +static int ioc_cost_model_show(struct seq_file *sf, void *v) +{ + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); + + blkcg_print_blkgs(sf, blkcg, ioc_cost_model_prfill, + &blkcg_policy_iocost, seq_cft(sf)->private, false); + return 0; +} + +static const match_table_t cost_ctrl_tokens = { + { COST_CTRL, "ctrl=%s" }, + { COST_MODEL, "model=%s" }, + { NR_COST_CTRL_PARAMS, NULL }, +}; + +static const match_table_t i_lcoef_tokens = { + { I_LCOEF_RBPS, "rbps=%u" }, + { I_LCOEF_RSEQIOPS, "rseqiops=%u" }, + { I_LCOEF_RRANDIOPS, "rrandiops=%u" }, + { I_LCOEF_WBPS, "wbps=%u" }, + { I_LCOEF_WSEQIOPS, "wseqiops=%u" }, + { I_LCOEF_WRANDIOPS, "wrandiops=%u" }, + { NR_I_LCOEFS, NULL }, +}; + +static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input, + size_t nbytes, loff_t off) +{ + struct gendisk *disk; + struct ioc *ioc; + u64 u[NR_I_LCOEFS]; + bool user; + char *p; + int ret; + + disk = blkcg_conf_get_disk(&input); + if (IS_ERR(disk)) + return PTR_ERR(disk); + + ioc = q_to_ioc(disk->queue); + if (!ioc) { + ret = blk_iocost_init(disk->queue); + if (ret) + goto err; + ioc = q_to_ioc(disk->queue); + } + + spin_lock_irq(&ioc->lock); + memcpy(u, ioc->params.i_lcoefs, sizeof(u)); + user = ioc->user_cost_model; + spin_unlock_irq(&ioc->lock); + + while ((p = strsep(&input, " \t\n"))) { + substring_t args[MAX_OPT_ARGS]; + char buf[32]; + int tok; + u64 v; + + if (!*p) + continue; + + switch (match_token(p, cost_ctrl_tokens, args)) { + case COST_CTRL: + match_strlcpy(buf, &args[0], sizeof(buf)); + if (!strcmp(buf, "auto")) + user = false; + else if (!strcmp(buf, "user")) + user = true; + else + goto einval; + continue; + case COST_MODEL: + match_strlcpy(buf, &args[0], sizeof(buf)); + if (strcmp(buf, "linear")) + goto einval; + continue; + } + + tok = match_token(p, i_lcoef_tokens, args); + if (tok == NR_I_LCOEFS) + goto einval; + if (match_u64(&args[0], &v)) + goto einval; + u[tok] = v; + user = true; + } + + spin_lock_irq(&ioc->lock); + if (user) { + memcpy(ioc->params.i_lcoefs, u, sizeof(u)); + ioc->user_cost_model = true; + } else { + ioc->user_cost_model = false; + } + ioc_refresh_params(ioc, true); + spin_unlock_irq(&ioc->lock); + + put_disk_and_module(disk); + return nbytes; + +einval: + ret = -EINVAL; +err: + put_disk_and_module(disk); + return ret; +} + +static struct cftype ioc_files[] = { + { + .name = "weight", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = ioc_weight_show, + .write = ioc_weight_write, + }, + { + .name = "cost.qos", + .flags = CFTYPE_ONLY_ON_ROOT, + .seq_show = ioc_qos_show, + .write = ioc_qos_write, + }, + { + .name = "cost.model", + .flags = CFTYPE_ONLY_ON_ROOT, + .seq_show = ioc_cost_model_show, + .write = ioc_cost_model_write, + }, + {} +}; + +static struct blkcg_policy blkcg_policy_iocost = { + .dfl_cftypes = ioc_files, + .cpd_alloc_fn = ioc_cpd_alloc, + .cpd_free_fn = ioc_cpd_free, + .pd_alloc_fn = ioc_pd_alloc, + .pd_init_fn = ioc_pd_init, + .pd_free_fn = ioc_pd_free, +}; + +static int __init ioc_init(void) +{ + return blkcg_policy_register(&blkcg_policy_iocost); +} + +static void __exit ioc_exit(void) +{ + return blkcg_policy_unregister(&blkcg_policy_iocost); +} + +module_init(ioc_init); +module_exit(ioc_exit); diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c index 0fff7b56df0e..c128d50cb410 100644 --- a/block/blk-iolatency.c +++ b/block/blk-iolatency.c @@ -725,7 +725,7 @@ int blk_iolatency_init(struct request_queue *q) return -ENOMEM; rqos = &blkiolat->rqos; - rqos->id = RQ_QOS_CGROUP; + rqos->id = RQ_QOS_LATENCY; rqos->ops = &blkcg_iolatency_ops; rqos->q = q; @@ -934,11 +934,13 @@ static size_t iolatency_pd_stat(struct blkg_policy_data *pd, char *buf, } -static struct blkg_policy_data *iolatency_pd_alloc(gfp_t gfp, int node) +static struct blkg_policy_data *iolatency_pd_alloc(gfp_t gfp, + struct request_queue *q, + struct blkcg *blkcg) { struct iolatency_grp *iolat; - iolat = kzalloc_node(sizeof(*iolat), gfp, node); + iolat = kzalloc_node(sizeof(*iolat), gfp, q->node); if (!iolat) return NULL; iolat->stats = __alloc_percpu_gfp(sizeof(struct latency_stat), diff --git a/block/blk-merge.c b/block/blk-merge.c index 57f7990b342d..48e6725b32ee 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -132,19 +132,32 @@ static struct bio *blk_bio_write_same_split(struct request_queue *q, return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs); } +/* + * Return the maximum number of sectors from the start of a bio that may be + * submitted as a single request to a block device. If enough sectors remain, + * align the end to the physical block size. Otherwise align the end to the + * logical block size. This approach minimizes the number of non-aligned + * requests that are submitted to a block device if the start of a bio is not + * aligned to a physical block boundary. + */ static inline unsigned get_max_io_size(struct request_queue *q, struct bio *bio) { unsigned sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector); - unsigned mask = queue_logical_block_size(q) - 1; + unsigned max_sectors = sectors; + unsigned pbs = queue_physical_block_size(q) >> SECTOR_SHIFT; + unsigned lbs = queue_logical_block_size(q) >> SECTOR_SHIFT; + unsigned start_offset = bio->bi_iter.bi_sector & (pbs - 1); - /* aligned to logical block size */ - sectors &= ~(mask >> 9); + max_sectors += start_offset; + max_sectors &= ~(pbs - 1); + if (max_sectors > start_offset) + return max_sectors - start_offset; - return sectors; + return sectors & (lbs - 1); } -static unsigned get_max_segment_size(struct request_queue *q, +static unsigned get_max_segment_size(const struct request_queue *q, unsigned offset) { unsigned long mask = queue_segment_boundary(q); @@ -157,26 +170,41 @@ static unsigned get_max_segment_size(struct request_queue *q, queue_max_segment_size(q)); } -/* - * Split the bvec @bv into segments, and update all kinds of - * variables. +/** + * bvec_split_segs - verify whether or not a bvec should be split in the middle + * @q: [in] request queue associated with the bio associated with @bv + * @bv: [in] bvec to examine + * @nsegs: [in,out] Number of segments in the bio being built. Incremented + * by the number of segments from @bv that may be appended to that + * bio without exceeding @max_segs + * @sectors: [in,out] Number of sectors in the bio being built. Incremented + * by the number of sectors from @bv that may be appended to that + * bio without exceeding @max_sectors + * @max_segs: [in] upper bound for *@nsegs + * @max_sectors: [in] upper bound for *@sectors + * + * When splitting a bio, it can happen that a bvec is encountered that is too + * big to fit in a single segment and hence that it has to be split in the + * middle. This function verifies whether or not that should happen. The value + * %true is returned if and only if appending the entire @bv to a bio with + * *@nsegs segments and *@sectors sectors would make that bio unacceptable for + * the block driver. */ -static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv, - unsigned *nsegs, unsigned *sectors, unsigned max_segs) +static bool bvec_split_segs(const struct request_queue *q, + const struct bio_vec *bv, unsigned *nsegs, + unsigned *sectors, unsigned max_segs, + unsigned max_sectors) { - unsigned len = bv->bv_len; + unsigned max_len = (min(max_sectors, UINT_MAX >> 9) - *sectors) << 9; + unsigned len = min(bv->bv_len, max_len); unsigned total_len = 0; - unsigned new_nsegs = 0, seg_size = 0; + unsigned seg_size = 0; - /* - * Multi-page bvec may be too big to hold in one segment, so the - * current bvec has to be splitted as multiple segments. - */ - while (len && new_nsegs + *nsegs < max_segs) { + while (len && *nsegs < max_segs) { seg_size = get_max_segment_size(q, bv->bv_offset + total_len); seg_size = min(seg_size, len); - new_nsegs++; + (*nsegs)++; total_len += seg_size; len -= seg_size; @@ -184,16 +212,31 @@ static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv, break; } - if (new_nsegs) { - *nsegs += new_nsegs; - if (sectors) - *sectors += total_len >> 9; - } + *sectors += total_len >> 9; - /* split in the middle of the bvec if len != 0 */ - return !!len; + /* tell the caller to split the bvec if it is too big to fit */ + return len > 0 || bv->bv_len > max_len; } +/** + * blk_bio_segment_split - split a bio in two bios + * @q: [in] request queue pointer + * @bio: [in] bio to be split + * @bs: [in] bio set to allocate the clone from + * @segs: [out] number of segments in the bio with the first half of the sectors + * + * Clone @bio, update the bi_iter of the clone to represent the first sectors + * of @bio and update @bio->bi_iter to represent the remaining sectors. The + * following is guaranteed for the cloned bio: + * - That it has at most get_max_io_size(@q, @bio) sectors. + * - That it has at most queue_max_segments(@q) segments. + * + * Except for discard requests the cloned bio will point at the bi_io_vec of + * the original bio. It is the responsibility of the caller to ensure that the + * original bio is not freed before the cloned bio. The caller is also + * responsible for ensuring that @bs is only destroyed after processing of the + * split bio has finished. + */ static struct bio *blk_bio_segment_split(struct request_queue *q, struct bio *bio, struct bio_set *bs, @@ -213,34 +256,18 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset)) goto split; - if (sectors + (bv.bv_len >> 9) > max_sectors) { - /* - * Consider this a new segment if we're splitting in - * the middle of this vector. - */ - if (nsegs < max_segs && - sectors < max_sectors) { - /* split in the middle of bvec */ - bv.bv_len = (max_sectors - sectors) << 9; - bvec_split_segs(q, &bv, &nsegs, - §ors, max_segs); - } + if (nsegs < max_segs && + sectors + (bv.bv_len >> 9) <= max_sectors && + bv.bv_offset + bv.bv_len <= PAGE_SIZE) { + nsegs++; + sectors += bv.bv_len >> 9; + } else if (bvec_split_segs(q, &bv, &nsegs, §ors, max_segs, + max_sectors)) { goto split; } - if (nsegs == max_segs) - goto split; - bvprv = bv; bvprvp = &bvprv; - - if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) { - nsegs++; - sectors += bv.bv_len >> 9; - } else if (bvec_split_segs(q, &bv, &nsegs, §ors, - max_segs)) { - goto split; - } } *segs = nsegs; @@ -250,6 +277,19 @@ split: return bio_split(bio, sectors, GFP_NOIO, bs); } +/** + * __blk_queue_split - split a bio and submit the second half + * @q: [in] request queue pointer + * @bio: [in, out] bio to be split + * @nr_segs: [out] number of segments in the first bio + * + * Split a bio into two bios, chain the two bios, submit the second half and + * store a pointer to the first half in *@bio. If the second bio is still too + * big it will be split by a recursive call to this function. Since this + * function may allocate a new bio from @q->bio_split, it is the responsibility + * of the caller to ensure that @q is only released after processing of the + * split bio has finished. + */ void __blk_queue_split(struct request_queue *q, struct bio **bio, unsigned int *nr_segs) { @@ -294,6 +334,17 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio, } } +/** + * blk_queue_split - split a bio and submit the second half + * @q: [in] request queue pointer + * @bio: [in, out] bio to be split + * + * Split a bio into two bios, chains the two bios, submit the second half and + * store a pointer to the first half in *@bio. Since this function may allocate + * a new bio from @q->bio_split, it is the responsibility of the caller to + * ensure that @q is only released after processing of the split bio has + * finished. + */ void blk_queue_split(struct request_queue *q, struct bio **bio) { unsigned int nr_segs; @@ -305,6 +356,7 @@ EXPORT_SYMBOL(blk_queue_split); unsigned int blk_recalc_rq_segments(struct request *rq) { unsigned int nr_phys_segs = 0; + unsigned int nr_sectors = 0; struct req_iterator iter; struct bio_vec bv; @@ -321,7 +373,8 @@ unsigned int blk_recalc_rq_segments(struct request *rq) } rq_for_each_bvec(bv, rq, iter) - bvec_split_segs(rq->q, &bv, &nr_phys_segs, NULL, UINT_MAX); + bvec_split_segs(rq->q, &bv, &nr_phys_segs, &nr_sectors, + UINT_MAX, UINT_MAX); return nr_phys_segs; } diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index f945621a0e8f..0157f2b3485a 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -15,10 +15,10 @@ #include "blk.h" #include "blk-mq.h" -static int cpu_to_queue_index(struct blk_mq_queue_map *qmap, - unsigned int nr_queues, const int cpu) +static int queue_index(struct blk_mq_queue_map *qmap, + unsigned int nr_queues, const int q) { - return qmap->queue_offset + (cpu % nr_queues); + return qmap->queue_offset + (q % nr_queues); } static int get_first_sibling(unsigned int cpu) @@ -36,21 +36,36 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap) { unsigned int *map = qmap->mq_map; unsigned int nr_queues = qmap->nr_queues; - unsigned int cpu, first_sibling; + unsigned int cpu, first_sibling, q = 0; + + for_each_possible_cpu(cpu) + map[cpu] = -1; + + /* + * Spread queues among present CPUs first for minimizing + * count of dead queues which are mapped by all un-present CPUs + */ + for_each_present_cpu(cpu) { + if (q >= nr_queues) + break; + map[cpu] = queue_index(qmap, nr_queues, q++); + } for_each_possible_cpu(cpu) { + if (map[cpu] != -1) + continue; /* * First do sequential mapping between CPUs and queues. * In case we still have CPUs to map, and we have some number of * threads per cores then map sibling threads to the same queue * for performance optimizations. */ - if (cpu < nr_queues) { - map[cpu] = cpu_to_queue_index(qmap, nr_queues, cpu); + if (q < nr_queues) { + map[cpu] = queue_index(qmap, nr_queues, q++); } else { first_sibling = get_first_sibling(cpu); if (first_sibling == cpu) - map[cpu] = cpu_to_queue_index(qmap, nr_queues, cpu); + map[cpu] = queue_index(qmap, nr_queues, q++); else map[cpu] = map[first_sibling]; } diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index d6e1a9bd7131..a0d3ce30fa08 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -270,7 +270,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i; - lockdep_assert_held(&q->sysfs_lock); + lockdep_assert_held(&q->sysfs_dir_lock); queue_for_each_hw_ctx(q, hctx, i) blk_mq_unregister_hctx(hctx); @@ -320,7 +320,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q) int ret, i; WARN_ON_ONCE(!q->kobj.parent); - lockdep_assert_held(&q->sysfs_lock); + lockdep_assert_held(&q->sysfs_dir_lock); ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq"); if (ret < 0) @@ -349,23 +349,12 @@ unreg: return ret; } -int blk_mq_register_dev(struct device *dev, struct request_queue *q) -{ - int ret; - - mutex_lock(&q->sysfs_lock); - ret = __blk_mq_register_dev(dev, q); - mutex_unlock(&q->sysfs_lock); - - return ret; -} - void blk_mq_sysfs_unregister(struct request_queue *q) { struct blk_mq_hw_ctx *hctx; int i; - mutex_lock(&q->sysfs_lock); + mutex_lock(&q->sysfs_dir_lock); if (!q->mq_sysfs_init_done) goto unlock; @@ -373,7 +362,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q) blk_mq_unregister_hctx(hctx); unlock: - mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock); } int blk_mq_sysfs_register(struct request_queue *q) @@ -381,7 +370,7 @@ int blk_mq_sysfs_register(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i, ret = 0; - mutex_lock(&q->sysfs_lock); + mutex_lock(&q->sysfs_dir_lock); if (!q->mq_sysfs_init_done) goto unlock; @@ -392,7 +381,7 @@ int blk_mq_sysfs_register(struct request_queue *q) } unlock: - mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock); return ret; } diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index da19f0bc8876..008388e82b5c 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -10,6 +10,7 @@ #include <linux/module.h> #include <linux/blk-mq.h> +#include <linux/delay.h> #include "blk.h" #include "blk-mq.h" #include "blk-mq-tag.h" @@ -354,6 +355,37 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, } EXPORT_SYMBOL(blk_mq_tagset_busy_iter); +static bool blk_mq_tagset_count_completed_rqs(struct request *rq, + void *data, bool reserved) +{ + unsigned *count = data; + + if (blk_mq_request_completed(rq)) + (*count)++; + return true; +} + +/** + * blk_mq_tagset_wait_completed_request - wait until all completed req's + * complete funtion is run + * @tagset: Tag set to drain completed request + * + * Note: This function has to be run after all IO queues are shutdown + */ +void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset) +{ + while (true) { + unsigned count = 0; + + blk_mq_tagset_busy_iter(tagset, + blk_mq_tagset_count_completed_rqs, &count); + if (!count) + break; + msleep(5); + } +} +EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request); + /** * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag * @q: Request queue to examine. diff --git a/block/blk-mq.c b/block/blk-mq.c index e0b849bfe74d..20a49be536b5 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -44,12 +44,12 @@ static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb); static int blk_mq_poll_stats_bkt(const struct request *rq) { - int ddir, bytes, bucket; + int ddir, sectors, bucket; ddir = rq_data_dir(rq); - bytes = blk_rq_bytes(rq); + sectors = blk_rq_stats_sectors(rq); - bucket = ddir + 2*(ilog2(bytes) - 9); + bucket = ddir + 2 * ilog2(sectors); if (bucket < 0) return -1; @@ -282,16 +282,16 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx) EXPORT_SYMBOL(blk_mq_can_queue); /* - * Only need start/end time stamping if we have stats enabled, or using - * an IO scheduler. + * Only need start/end time stamping if we have iostat or + * blk stats enabled, or using an IO scheduler. */ static inline bool blk_mq_need_time_stamp(struct request *rq) { - return (rq->rq_flags & RQF_IO_STAT) || rq->q->elevator; + return (rq->rq_flags & (RQF_IO_STAT | RQF_STATS)) || rq->q->elevator; } static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, - unsigned int tag, unsigned int op) + unsigned int tag, unsigned int op, u64 alloc_time_ns) { struct blk_mq_tags *tags = blk_mq_tags_from_data(data); struct request *rq = tags->static_rqs[tag]; @@ -325,11 +325,15 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, RB_CLEAR_NODE(&rq->rb_node); rq->rq_disk = NULL; rq->part = NULL; +#ifdef CONFIG_BLK_RQ_ALLOC_TIME + rq->alloc_time_ns = alloc_time_ns; +#endif if (blk_mq_need_time_stamp(rq)) rq->start_time_ns = ktime_get_ns(); else rq->start_time_ns = 0; rq->io_start_time_ns = 0; + rq->stats_sectors = 0; rq->nr_phys_segments = 0; #if defined(CONFIG_BLK_DEV_INTEGRITY) rq->nr_integrity_segments = 0; @@ -356,8 +360,14 @@ static struct request *blk_mq_get_request(struct request_queue *q, struct request *rq; unsigned int tag; bool clear_ctx_on_error = false; + u64 alloc_time_ns = 0; blk_queue_enter_live(q); + + /* alloc_time includes depth and tag waits */ + if (blk_queue_rq_alloc_time(q)) + alloc_time_ns = ktime_get_ns(); + data->q = q; if (likely(!data->ctx)) { data->ctx = blk_mq_get_ctx(q); @@ -393,7 +403,7 @@ static struct request *blk_mq_get_request(struct request_queue *q, return NULL; } - rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags); + rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns); if (!op_is_flush(data->cmd_flags)) { rq->elv.icq = NULL; if (e && e->type->ops.prepare_request) { @@ -652,19 +662,18 @@ bool blk_mq_complete_request(struct request *rq) } EXPORT_SYMBOL(blk_mq_complete_request); -void blk_mq_complete_request_sync(struct request *rq) -{ - WRITE_ONCE(rq->state, MQ_RQ_COMPLETE); - rq->q->mq_ops->complete(rq); -} -EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync); - int blk_mq_request_started(struct request *rq) { return blk_mq_rq_state(rq) != MQ_RQ_IDLE; } EXPORT_SYMBOL_GPL(blk_mq_request_started); +int blk_mq_request_completed(struct request *rq) +{ + return blk_mq_rq_state(rq) == MQ_RQ_COMPLETE; +} +EXPORT_SYMBOL_GPL(blk_mq_request_completed); + void blk_mq_start_request(struct request *rq) { struct request_queue *q = rq->q; @@ -673,9 +682,7 @@ void blk_mq_start_request(struct request *rq) if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { rq->io_start_time_ns = ktime_get_ns(); -#ifdef CONFIG_BLK_DEV_THROTTLING_LOW - rq->throtl_size = blk_rq_sectors(rq); -#endif + rq->stats_sectors = blk_rq_sectors(rq); rq->rq_flags |= RQF_STATS; rq_qos_issue(q, rq); } @@ -2453,11 +2460,6 @@ static void blk_mq_map_swqueue(struct request_queue *q) struct blk_mq_ctx *ctx; struct blk_mq_tag_set *set = q->tag_set; - /* - * Avoid others reading imcomplete hctx->cpumask through sysfs - */ - mutex_lock(&q->sysfs_lock); - queue_for_each_hw_ctx(q, hctx, i) { cpumask_clear(hctx->cpumask); hctx->nr_ctx = 0; @@ -2518,8 +2520,6 @@ static void blk_mq_map_swqueue(struct request_queue *q) HCTX_TYPE_DEFAULT, i); } - mutex_unlock(&q->sysfs_lock); - queue_for_each_hw_ctx(q, hctx, i) { /* * If no software queues are mapped to this hardware queue, @@ -2688,7 +2688,11 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set) if (!uninit_q) return ERR_PTR(-ENOMEM); - q = blk_mq_init_allocated_queue(set, uninit_q); + /* + * Initialize the queue without an elevator. device_add_disk() will do + * the initialization. + */ + q = blk_mq_init_allocated_queue(set, uninit_q, false); if (IS_ERR(q)) blk_cleanup_queue(uninit_q); @@ -2839,7 +2843,8 @@ static unsigned int nr_hw_queues(struct blk_mq_tag_set *set) } struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, - struct request_queue *q) + struct request_queue *q, + bool elevator_init) { /* mark the queue as mq asap */ q->mq_ops = set->ops; @@ -2901,18 +2906,14 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, blk_mq_add_queue_tag_set(set, q); blk_mq_map_swqueue(q); - if (!(set->flags & BLK_MQ_F_NO_SCHED)) { - int ret; - - ret = elevator_init_mq(q); - if (ret) - return ERR_PTR(ret); - } + if (elevator_init) + elevator_init_mq(q); return q; err_hctxs: kfree(q->queue_hw_ctx); + q->nr_hw_queues = 0; err_sys_init: blk_mq_sysfs_deinit(q); err_poll: diff --git a/block/blk-pm.c b/block/blk-pm.c index 0a028c189897..1adc1cd748b4 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -207,10 +207,12 @@ EXPORT_SYMBOL(blk_post_runtime_resume); */ void blk_set_runtime_active(struct request_queue *q) { - spin_lock_irq(&q->queue_lock); - q->rpm_status = RPM_ACTIVE; - pm_runtime_mark_last_busy(q->dev); - pm_request_autosuspend(q->dev); - spin_unlock_irq(&q->queue_lock); + if (q->dev) { + spin_lock_irq(&q->queue_lock); + q->rpm_status = RPM_ACTIVE; + pm_runtime_mark_last_busy(q->dev); + pm_request_autosuspend(q->dev); + spin_unlock_irq(&q->queue_lock); + } } EXPORT_SYMBOL(blk_set_runtime_active); diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c index 3954c0dc1443..61b635bc2a31 100644 --- a/block/blk-rq-qos.c +++ b/block/blk-rq-qos.c @@ -83,6 +83,15 @@ void __rq_qos_track(struct rq_qos *rqos, struct request *rq, struct bio *bio) } while (rqos); } +void __rq_qos_merge(struct rq_qos *rqos, struct request *rq, struct bio *bio) +{ + do { + if (rqos->ops->merge) + rqos->ops->merge(rqos, rq, bio); + rqos = rqos->next; + } while (rqos); +} + void __rq_qos_done_bio(struct rq_qos *rqos, struct bio *bio) { do { @@ -92,6 +101,15 @@ void __rq_qos_done_bio(struct rq_qos *rqos, struct bio *bio) } while (rqos); } +void __rq_qos_queue_depth_changed(struct rq_qos *rqos) +{ + do { + if (rqos->ops->queue_depth_changed) + rqos->ops->queue_depth_changed(rqos); + rqos = rqos->next; + } while (rqos); +} + /* * Return true, if we can't increase the depth further by scaling */ diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h index 2300e038b9fa..08a09dbe0f4b 100644 --- a/block/blk-rq-qos.h +++ b/block/blk-rq-qos.h @@ -14,7 +14,8 @@ struct blk_mq_debugfs_attr; enum rq_qos_id { RQ_QOS_WBT, - RQ_QOS_CGROUP, + RQ_QOS_LATENCY, + RQ_QOS_COST, }; struct rq_wait { @@ -35,11 +36,13 @@ struct rq_qos { struct rq_qos_ops { void (*throttle)(struct rq_qos *, struct bio *); void (*track)(struct rq_qos *, struct request *, struct bio *); + void (*merge)(struct rq_qos *, struct request *, struct bio *); void (*issue)(struct rq_qos *, struct request *); void (*requeue)(struct rq_qos *, struct request *); void (*done)(struct rq_qos *, struct request *); void (*done_bio)(struct rq_qos *, struct bio *); void (*cleanup)(struct rq_qos *, struct bio *); + void (*queue_depth_changed)(struct rq_qos *); void (*exit)(struct rq_qos *); const struct blk_mq_debugfs_attr *debugfs_attrs; }; @@ -72,7 +75,7 @@ static inline struct rq_qos *wbt_rq_qos(struct request_queue *q) static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q) { - return rq_qos_id(q, RQ_QOS_CGROUP); + return rq_qos_id(q, RQ_QOS_LATENCY); } static inline const char *rq_qos_id_to_name(enum rq_qos_id id) @@ -80,8 +83,10 @@ static inline const char *rq_qos_id_to_name(enum rq_qos_id id) switch (id) { case RQ_QOS_WBT: return "wbt"; - case RQ_QOS_CGROUP: - return "cgroup"; + case RQ_QOS_LATENCY: + return "latency"; + case RQ_QOS_COST: + return "cost"; } return "unknown"; } @@ -135,7 +140,9 @@ void __rq_qos_issue(struct rq_qos *rqos, struct request *rq); void __rq_qos_requeue(struct rq_qos *rqos, struct request *rq); void __rq_qos_throttle(struct rq_qos *rqos, struct bio *bio); void __rq_qos_track(struct rq_qos *rqos, struct request *rq, struct bio *bio); +void __rq_qos_merge(struct rq_qos *rqos, struct request *rq, struct bio *bio); void __rq_qos_done_bio(struct rq_qos *rqos, struct bio *bio); +void __rq_qos_queue_depth_changed(struct rq_qos *rqos); static inline void rq_qos_cleanup(struct request_queue *q, struct bio *bio) { @@ -185,6 +192,19 @@ static inline void rq_qos_track(struct request_queue *q, struct request *rq, __rq_qos_track(q->rq_qos, rq, bio); } +static inline void rq_qos_merge(struct request_queue *q, struct request *rq, + struct bio *bio) +{ + if (q->rq_qos) + __rq_qos_merge(q->rq_qos, rq, bio); +} + +static inline void rq_qos_queue_depth_changed(struct request_queue *q) +{ + if (q->rq_qos) + __rq_qos_queue_depth_changed(q->rq_qos); +} + void rq_qos_exit(struct request_queue *); #endif diff --git a/block/blk-settings.c b/block/blk-settings.c index 2c1831207a8f..6bd1e3b082d8 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -805,7 +805,7 @@ EXPORT_SYMBOL(blk_queue_update_dma_alignment); void blk_set_queue_depth(struct request_queue *q, unsigned int depth) { q->queue_depth = depth; - wbt_set_queue_depth(q, depth); + rq_qos_queue_depth_changed(q); } EXPORT_SYMBOL(blk_set_queue_depth); @@ -832,6 +832,22 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua) } EXPORT_SYMBOL_GPL(blk_queue_write_cache); +/** + * blk_queue_required_elevator_features - Set a queue required elevator features + * @q: the request queue for the target device + * @features: Required elevator features OR'ed together + * + * Tell the block layer that for the device controlled through @q, only the + * only elevators that can be used are those that implement at least the set of + * features specified by @features. + */ +void blk_queue_required_elevator_features(struct request_queue *q, + unsigned int features) +{ + q->required_elevator_features = features; +} +EXPORT_SYMBOL_GPL(blk_queue_required_elevator_features); + static int __init blk_settings_init(void) { blk_max_low_pfn = max_low_pfn - 1; diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 9bfa3ea4ed63..b82736c781c5 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -941,14 +941,14 @@ int blk_register_queue(struct gendisk *disk) int ret; struct device *dev = disk_to_dev(disk); struct request_queue *q = disk->queue; + bool has_elevator = false; if (WARN_ON(!q)) return -ENXIO; - WARN_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags), + WARN_ONCE(blk_queue_registered(q), "%s is registering an already registered queue\n", kobject_name(&dev->kobj)); - blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); /* * SCSI probing may synchronously create and destroy a lot of @@ -968,8 +968,7 @@ int blk_register_queue(struct gendisk *disk) if (ret) return ret; - /* Prevent changes through sysfs until registration is completed. */ - mutex_lock(&q->sysfs_lock); + mutex_lock(&q->sysfs_dir_lock); ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue"); if (ret < 0) { @@ -990,26 +989,36 @@ int blk_register_queue(struct gendisk *disk) blk_mq_debugfs_register(q); } - kobject_uevent(&q->kobj, KOBJ_ADD); - - wbt_enable_default(q); - - blk_throtl_register_queue(q); - + /* + * The flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator + * switch won't happen at all. + */ if (q->elevator) { - ret = elv_register_queue(q); + ret = elv_register_queue(q, false); if (ret) { - mutex_unlock(&q->sysfs_lock); - kobject_uevent(&q->kobj, KOBJ_REMOVE); + mutex_unlock(&q->sysfs_dir_lock); kobject_del(&q->kobj); blk_trace_remove_sysfs(dev); kobject_put(&dev->kobj); return ret; } + has_elevator = true; } + + mutex_lock(&q->sysfs_lock); + blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); + wbt_enable_default(q); + blk_throtl_register_queue(q); + + /* Now everything is ready and send out KOBJ_ADD uevent */ + kobject_uevent(&q->kobj, KOBJ_ADD); + if (has_elevator) + kobject_uevent(&q->elevator->kobj, KOBJ_ADD); + mutex_unlock(&q->sysfs_lock); + ret = 0; unlock: - mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock); return ret; } EXPORT_SYMBOL_GPL(blk_register_queue); @@ -1029,7 +1038,7 @@ void blk_unregister_queue(struct gendisk *disk) return; /* Return early if disk->queue was never registered. */ - if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) + if (!blk_queue_registered(q)) return; /* @@ -1038,25 +1047,28 @@ void blk_unregister_queue(struct gendisk *disk) * concurrent elv_iosched_store() calls. */ mutex_lock(&q->sysfs_lock); - blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q); + mutex_unlock(&q->sysfs_lock); + mutex_lock(&q->sysfs_dir_lock); /* * Remove the sysfs attributes before unregistering the queue data * structures that can be modified through sysfs. */ if (queue_is_mq(q)) blk_mq_unregister_dev(disk_to_dev(disk), q); - mutex_unlock(&q->sysfs_lock); kobject_uevent(&q->kobj, KOBJ_REMOVE); kobject_del(&q->kobj); blk_trace_remove_sysfs(disk_to_dev(disk)); - mutex_lock(&q->sysfs_lock); + /* + * q->kobj has been removed, so it is safe to check if elevator + * exists without holding q->sysfs_lock. + */ if (q->elevator) elv_unregister_queue(q); - mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock); kobject_put(&disk_to_dev(disk)->kobj); } diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 8ab6c8153223..18f773e52dfb 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -478,12 +478,14 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq) timer_setup(&sq->pending_timer, throtl_pending_timer_fn, 0); } -static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node) +static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, + struct request_queue *q, + struct blkcg *blkcg) { struct throtl_grp *tg; int rw; - tg = kzalloc_node(sizeof(*tg), gfp, node); + tg = kzalloc_node(sizeof(*tg), gfp, q->node); if (!tg) return NULL; @@ -2246,7 +2248,8 @@ void blk_throtl_stat_add(struct request *rq, u64 time_ns) struct request_queue *q = rq->q; struct throtl_data *td = q->td; - throtl_track_latency(td, rq->throtl_size, req_op(rq), time_ns >> 10); + throtl_track_latency(td, blk_rq_stats_sectors(rq), req_op(rq), + time_ns >> 10); } void blk_throtl_bio_endio(struct bio *bio) diff --git a/block/blk-wbt.c b/block/blk-wbt.c index 313f45a37e9d..8af553a0ba00 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -629,15 +629,6 @@ static void wbt_requeue(struct rq_qos *rqos, struct request *rq) } } -void wbt_set_queue_depth(struct request_queue *q, unsigned int depth) -{ - struct rq_qos *rqos = wbt_rq_qos(q); - if (rqos) { - RQWB(rqos)->rq_depth.queue_depth = depth; - __wbt_update_limits(RQWB(rqos)); - } -} - void wbt_set_write_cache(struct request_queue *q, bool write_cache_on) { struct rq_qos *rqos = wbt_rq_qos(q); @@ -656,7 +647,7 @@ void wbt_enable_default(struct request_queue *q) return; /* Queue not registered? Maybe shutting down... */ - if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) + if (!blk_queue_registered(q)) return; if (queue_is_mq(q) && IS_ENABLED(CONFIG_BLK_WBT_MQ)) @@ -689,6 +680,12 @@ static int wbt_data_dir(const struct request *rq) return -1; } +static void wbt_queue_depth_changed(struct rq_qos *rqos) +{ + RQWB(rqos)->rq_depth.queue_depth = blk_queue_depth(rqos->q); + __wbt_update_limits(RQWB(rqos)); +} + static void wbt_exit(struct rq_qos *rqos) { struct rq_wb *rwb = RQWB(rqos); @@ -811,6 +808,7 @@ static struct rq_qos_ops wbt_rqos_ops = { .requeue = wbt_requeue, .done = wbt_done, .cleanup = wbt_cleanup, + .queue_depth_changed = wbt_queue_depth_changed, .exit = wbt_exit, #ifdef CONFIG_BLK_DEBUG_FS .debugfs_attrs = wbt_debugfs_attrs, @@ -853,7 +851,7 @@ int wbt_init(struct request_queue *q) rwb->min_lat_nsec = wbt_default_latency_nsec(q); - wbt_set_queue_depth(q, blk_queue_depth(q)); + wbt_queue_depth_changed(&rwb->rqos); wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); return 0; diff --git a/block/blk-wbt.h b/block/blk-wbt.h index f47218d5b3b2..8e4e37660971 100644 --- a/block/blk-wbt.h +++ b/block/blk-wbt.h @@ -95,7 +95,6 @@ void wbt_enable_default(struct request_queue *); u64 wbt_get_min_lat(struct request_queue *q); void wbt_set_min_lat(struct request_queue *q, u64 val); -void wbt_set_queue_depth(struct request_queue *, unsigned int); void wbt_set_write_cache(struct request_queue *, bool); u64 wbt_default_latency_nsec(struct request_queue *); @@ -118,9 +117,6 @@ static inline void wbt_disable_default(struct request_queue *q) static inline void wbt_enable_default(struct request_queue *q) { } -static inline void wbt_set_queue_depth(struct request_queue *q, unsigned int depth) -{ -} static inline void wbt_set_write_cache(struct request_queue *q, bool wc) { } diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 6c503824ba3f..4bc5f260248a 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -202,6 +202,42 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector, } EXPORT_SYMBOL_GPL(blkdev_report_zones); +/* + * Special case of zone reset operation to reset all zones in one command, + * useful for applications like mkfs. + */ +static int __blkdev_reset_all_zones(struct block_device *bdev, gfp_t gfp_mask) +{ + struct bio *bio = bio_alloc(gfp_mask, 0); + int ret; + + /* across the zones operations, don't need any sectors */ + bio_set_dev(bio, bdev); + bio_set_op_attrs(bio, REQ_OP_ZONE_RESET_ALL, 0); + + ret = submit_bio_wait(bio); + bio_put(bio); + + return ret; +} + +static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev, + sector_t nr_sectors) +{ + if (!blk_queue_zone_resetall(bdev_get_queue(bdev))) + return false; + + if (nr_sectors != part_nr_sects_read(bdev->bd_part)) + return false; + /* + * REQ_OP_ZONE_RESET_ALL can be executed only if the block device is + * the entire disk, that is, if the blocks device start offset is 0 and + * its capacity is the same as the entire disk. + */ + return get_start_sect(bdev) == 0 && + part_nr_sects_read(bdev->bd_part) == get_capacity(bdev->bd_disk); +} + /** * blkdev_reset_zones - Reset zones write pointer * @bdev: Target block device @@ -235,6 +271,9 @@ int blkdev_reset_zones(struct block_device *bdev, /* Out of range */ return -EINVAL; + if (blkdev_allow_reset_all_zones(bdev, nr_sectors)) + return __blkdev_reset_all_zones(bdev, gfp_mask); + /* Check alignment (handle eventual smaller last zone) */ zone_sectors = blk_queue_zone_sectors(q); if (sector & (zone_sectors - 1)) diff --git a/block/blk.h b/block/blk.h index de6b2e146d6e..ed347f7a97b1 100644 --- a/block/blk.h +++ b/block/blk.h @@ -184,11 +184,11 @@ void blk_account_io_done(struct request *req, u64 now); void blk_insert_flush(struct request *rq); -int elevator_init_mq(struct request_queue *q); +void elevator_init_mq(struct request_queue *q); int elevator_switch_mq(struct request_queue *q, struct elevator_type *new_e); void __elevator_exit(struct request_queue *, struct elevator_queue *); -int elv_register_queue(struct request_queue *q); +int elv_register_queue(struct request_queue *q, bool uevent); void elv_unregister_queue(struct request_queue *q); static inline void elevator_exit(struct request_queue *q, diff --git a/block/elevator.c b/block/elevator.c index 2f17d66d0e61..bba10e83478a 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -83,8 +83,26 @@ bool elv_bio_merge_ok(struct request *rq, struct bio *bio) } EXPORT_SYMBOL(elv_bio_merge_ok); -static bool elevator_match(const struct elevator_type *e, const char *name) +static inline bool elv_support_features(unsigned int elv_features, + unsigned int required_features) { + return (required_features & elv_features) == required_features; +} + +/** + * elevator_match - Test an elevator name and features + * @e: Scheduler to test + * @name: Elevator name to test + * @required_features: Features that the elevator must provide + * + * Return true is the elevator @e name matches @name and if @e provides all the + * the feratures spcified by @required_features. + */ +static bool elevator_match(const struct elevator_type *e, const char *name, + unsigned int required_features) +{ + if (!elv_support_features(e->elevator_features, required_features)) + return false; if (!strcmp(e->elevator_name, name)) return true; if (e->elevator_alias && !strcmp(e->elevator_alias, name)) @@ -93,15 +111,21 @@ static bool elevator_match(const struct elevator_type *e, const char *name) return false; } -/* - * Return scheduler with name 'name' +/** + * elevator_find - Find an elevator + * @name: Name of the elevator to find + * @required_features: Features that the elevator must provide + * + * Return the first registered scheduler with name @name and supporting the + * features @required_features and NULL otherwise. */ -static struct elevator_type *elevator_find(const char *name) +static struct elevator_type *elevator_find(const char *name, + unsigned int required_features) { struct elevator_type *e; list_for_each_entry(e, &elv_list, list) { - if (elevator_match(e, name)) + if (elevator_match(e, name, required_features)) return e; } @@ -120,12 +144,12 @@ static struct elevator_type *elevator_get(struct request_queue *q, spin_lock(&elv_list_lock); - e = elevator_find(name); + e = elevator_find(name, q->required_elevator_features); if (!e && try_loading) { spin_unlock(&elv_list_lock); request_module("%s-iosched", name); spin_lock(&elv_list_lock); - e = elevator_find(name); + e = elevator_find(name, q->required_elevator_features); } if (e && !try_module_get(e->elevator_owner)) @@ -135,20 +159,6 @@ static struct elevator_type *elevator_get(struct request_queue *q, return e; } -static char chosen_elevator[ELV_NAME_MAX]; - -static int __init elevator_setup(char *str) -{ - /* - * Be backwards-compatible with previous kernels, so users - * won't get the wrong elevator. - */ - strncpy(chosen_elevator, str, sizeof(chosen_elevator) - 1); - return 1; -} - -__setup("elevator=", elevator_setup); - static struct kobj_type elv_ktype; struct elevator_queue *elevator_alloc(struct request_queue *q, @@ -470,13 +480,16 @@ static struct kobj_type elv_ktype = { .release = elevator_release, }; -int elv_register_queue(struct request_queue *q) +/* + * elv_register_queue is called from either blk_register_queue or + * elevator_switch, elevator switch is prevented from being happen + * in the two paths, so it is safe to not hold q->sysfs_lock. + */ +int elv_register_queue(struct request_queue *q, bool uevent) { struct elevator_queue *e = q->elevator; int error; - lockdep_assert_held(&q->sysfs_lock); - error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched"); if (!error) { struct elv_fs_entry *attr = e->type->elevator_attrs; @@ -487,24 +500,34 @@ int elv_register_queue(struct request_queue *q) attr++; } } - kobject_uevent(&e->kobj, KOBJ_ADD); + if (uevent) + kobject_uevent(&e->kobj, KOBJ_ADD); + + mutex_lock(&q->sysfs_lock); e->registered = 1; + mutex_unlock(&q->sysfs_lock); } return error; } +/* + * elv_unregister_queue is called from either blk_unregister_queue or + * elevator_switch, elevator switch is prevented from being happen + * in the two paths, so it is safe to not hold q->sysfs_lock. + */ void elv_unregister_queue(struct request_queue *q) { - lockdep_assert_held(&q->sysfs_lock); - if (q) { struct elevator_queue *e = q->elevator; kobject_uevent(&e->kobj, KOBJ_REMOVE); kobject_del(&e->kobj); + + mutex_lock(&q->sysfs_lock); e->registered = 0; /* Re-enable throttling in case elevator disabled it */ wbt_enable_default(q); + mutex_unlock(&q->sysfs_lock); } } @@ -526,7 +549,7 @@ int elv_register(struct elevator_type *e) /* register, don't allow duplicate names */ spin_lock(&elv_list_lock); - if (elevator_find(e->elevator_name)) { + if (elevator_find(e->elevator_name, 0)) { spin_unlock(&elv_list_lock); kmem_cache_destroy(e->icq_cache); return -EBUSY; @@ -567,10 +590,32 @@ int elevator_switch_mq(struct request_queue *q, lockdep_assert_held(&q->sysfs_lock); if (q->elevator) { - if (q->elevator->registered) + if (q->elevator->registered) { + mutex_unlock(&q->sysfs_lock); + + /* + * Concurrent elevator switch can't happen becasue + * sysfs write is always exclusively on same file. + * + * Also the elevator queue won't be freed after + * sysfs_lock is released becasue kobject_del() in + * blk_unregister_queue() waits for completion of + * .store & .show on its attributes. + */ elv_unregister_queue(q); + + mutex_lock(&q->sysfs_lock); + } ioc_clear_queue(q); elevator_exit(q, q->elevator); + + /* + * sysfs_lock may be dropped, so re-check if queue is + * unregistered. If yes, don't switch to new elevator + * any more + */ + if (!blk_queue_registered(q)) + return 0; } ret = blk_mq_init_sched(q, new_e); @@ -578,7 +623,11 @@ int elevator_switch_mq(struct request_queue *q, goto out; if (new_e) { - ret = elv_register_queue(q); + mutex_unlock(&q->sysfs_lock); + + ret = elv_register_queue(q, true); + + mutex_lock(&q->sysfs_lock); if (ret) { elevator_exit(q, q->elevator); goto out; @@ -594,37 +643,89 @@ out: return ret; } +static inline bool elv_support_iosched(struct request_queue *q) +{ + if (q->tag_set && (q->tag_set->flags & BLK_MQ_F_NO_SCHED)) + return false; + return true; +} + /* - * For blk-mq devices, we default to using mq-deadline, if available, for single - * queue devices. If deadline isn't available OR we have multiple queues, - * default to "none". + * For single queue devices, default to using mq-deadline. If we have multiple + * queues or mq-deadline is not available, default to "none". */ -int elevator_init_mq(struct request_queue *q) +static struct elevator_type *elevator_get_default(struct request_queue *q) +{ + if (q->nr_hw_queues != 1) + return NULL; + + return elevator_get(q, "mq-deadline", false); +} + +/* + * Get the first elevator providing the features required by the request queue. + * Default to "none" if no matching elevator is found. + */ +static struct elevator_type *elevator_get_by_features(struct request_queue *q) +{ + struct elevator_type *e, *found = NULL; + + spin_lock(&elv_list_lock); + + list_for_each_entry(e, &elv_list, list) { + if (elv_support_features(e->elevator_features, + q->required_elevator_features)) { + found = e; + break; + } + } + + if (found && !try_module_get(found->elevator_owner)) + found = NULL; + + spin_unlock(&elv_list_lock); + return found; +} + +/* + * For a device queue that has no required features, use the default elevator + * settings. Otherwise, use the first elevator available matching the required + * features. If no suitable elevator is find or if the chosen elevator + * initialization fails, fall back to the "none" elevator (no elevator). + */ +void elevator_init_mq(struct request_queue *q) { struct elevator_type *e; - int err = 0; + int err; - if (q->nr_hw_queues != 1) - return 0; + if (!elv_support_iosched(q)) + return; + + WARN_ON_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)); - /* - * q->sysfs_lock must be held to provide mutual exclusion between - * elevator_switch() and here. - */ - mutex_lock(&q->sysfs_lock); if (unlikely(q->elevator)) - goto out_unlock; + return; - e = elevator_get(q, "mq-deadline", false); + if (!q->required_elevator_features) + e = elevator_get_default(q); + else + e = elevator_get_by_features(q); if (!e) - goto out_unlock; + return; + + blk_mq_freeze_queue(q); + blk_mq_quiesce_queue(q); err = blk_mq_init_sched(q, e); - if (err) + + blk_mq_unquiesce_queue(q); + blk_mq_unfreeze_queue(q); + + if (err) { + pr_warn("\"%s\" elevator initialization failed, " + "falling back to \"none\"\n", e->elevator_name); elevator_put(e); -out_unlock: - mutex_unlock(&q->sysfs_lock); - return err; + } } @@ -660,7 +761,7 @@ static int __elevator_change(struct request_queue *q, const char *name) struct elevator_type *e; /* Make sure queue is not in the middle of being removed */ - if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) + if (!blk_queue_registered(q)) return -ENOENT; /* @@ -677,7 +778,8 @@ static int __elevator_change(struct request_queue *q, const char *name) if (!e) return -EINVAL; - if (q->elevator && elevator_match(q->elevator->type, elevator_name)) { + if (q->elevator && + elevator_match(q->elevator->type, elevator_name, 0)) { elevator_put(e); return 0; } @@ -685,13 +787,6 @@ static int __elevator_change(struct request_queue *q, const char *name) return elevator_switch(q, e); } -static inline bool elv_support_iosched(struct request_queue *q) -{ - if (q->tag_set && (q->tag_set->flags & BLK_MQ_F_NO_SCHED)) - return false; - return true; -} - ssize_t elv_iosched_store(struct request_queue *q, const char *name, size_t count) { @@ -724,11 +819,13 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name) spin_lock(&elv_list_lock); list_for_each_entry(__e, &elv_list, list) { - if (elv && elevator_match(elv, __e->elevator_name)) { + if (elv && elevator_match(elv, __e->elevator_name, 0)) { len += sprintf(name+len, "[%s] ", elv->elevator_name); continue; } - if (elv_support_iosched(q)) + if (elv_support_iosched(q) && + elevator_match(__e, __e->elevator_name, + q->required_elevator_features)) len += sprintf(name+len, "%s ", __e->elevator_name); } spin_unlock(&elv_list_lock); diff --git a/block/genhd.c b/block/genhd.c index 54f1f0d381f4..26b31fcae217 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -695,6 +695,15 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk, dev_t devt; int retval; + /* + * The disk queue should now be all set with enough information about + * the device for the elevator code to pick an adequate default + * elevator if one is needed, that is, for devices requesting queue + * registration. + */ + if (register_queue) + elevator_init_mq(disk->queue); + /* minors == 0 indicates to use ext devt from part0 and should * be accompanied with EXT_DEVT flag. Make sure all * parameters make sense. diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 2a2a2e82832e..b490f47fd553 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -377,13 +377,6 @@ done: * hardware queue, but we may return a request that is for a * different hardware queue. This is because mq-deadline has shared * state for all hardware queues, in terms of sorting, FIFOs, etc. - * - * For a zoned block device, __dd_dispatch_request() may return NULL - * if all the queued write requests are directed at zones that are already - * locked due to on-going write requests. In this case, make sure to mark - * the queue as needing a restart to ensure that the queue is run again - * and the pending writes dispatched once the target zones for the ongoing - * write requests are unlocked in dd_finish_request(). */ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) { @@ -392,9 +385,6 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) spin_lock(&dd->lock); rq = __dd_dispatch_request(dd); - if (!rq && blk_queue_is_zoned(hctx->queue) && - !list_empty(&dd->fifo_list[WRITE])) - blk_mq_sched_mark_restart_hctx(hctx); spin_unlock(&dd->lock); return rq; @@ -561,6 +551,13 @@ static void dd_prepare_request(struct request *rq, struct bio *bio) * spinlock so that the zone is never unlocked while deadline_fifo_request() * or deadline_next_request() are executing. This function is called for * all requests, whether or not these requests complete successfully. + * + * For a zoned block device, __dd_dispatch_request() may have stopped + * dispatching requests if all the queued requests are write requests directed + * at zones that are already locked due to on-going write requests. To ensure + * write request dispatch progress in this case, mark the queue as needing a + * restart to ensure that the queue is run again after completion of the + * request and zones being unlocked. */ static void dd_finish_request(struct request *rq) { @@ -572,6 +569,8 @@ static void dd_finish_request(struct request *rq) spin_lock_irqsave(&dd->zone_lock, flags); blk_req_zone_write_unlock(rq); + if (!list_empty(&dd->fifo_list[WRITE])) + blk_mq_sched_mark_restart_hctx(rq->mq_hctx); spin_unlock_irqrestore(&dd->zone_lock, flags); } } @@ -795,6 +794,7 @@ static struct elevator_type mq_deadline = { .elevator_attrs = deadline_attrs, .elevator_name = "mq-deadline", .elevator_alias = "deadline", + .elevator_features = ELEVATOR_F_ZBD_SEQ_WRITE, .elevator_owner = THIS_MODULE, }; MODULE_ALIAS("mq-deadline-iosched"); diff --git a/block/opal_proto.h b/block/opal_proto.h index 466ec7be16ef..5532412d567c 100644 --- a/block/opal_proto.h +++ b/block/opal_proto.h @@ -119,8 +119,6 @@ enum opal_uid { OPAL_UID_HEXFF, }; -#define OPAL_METHOD_LENGTH 8 - /* Enum for indexing the OPALMETHOD array */ enum opal_method { OPAL_PROPERTIES, @@ -167,7 +165,6 @@ enum opal_token { OPAL_TABLE_LASTID = 0x0A, OPAL_TABLE_MIN = 0x0B, OPAL_TABLE_MAX = 0x0C, - /* authority table */ OPAL_PIN = 0x03, /* locking tokens */ @@ -182,7 +179,7 @@ enum opal_token { OPAL_LIFECYCLE = 0x06, /* locking info table */ OPAL_MAXRANGES = 0x04, - /* mbr control */ + /* mbr control */ OPAL_MBRENABLE = 0x01, OPAL_MBRDONE = 0x02, /* properties */ diff --git a/block/sed-opal.c b/block/sed-opal.c index 7e1a444a25b2..4e95a9792162 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -129,7 +129,6 @@ static const u8 opaluid[][OPAL_UID_LENGTH] = { { 0x00, 0x00, 0x00, 0x09, 0x00, 0x00, 0x84, 0x01 }, /* tables */ - [OPAL_TABLE_TABLE] { 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01 }, [OPAL_LOCKINGRANGE_GLOBAL] = @@ -152,7 +151,6 @@ static const u8 opaluid[][OPAL_UID_LENGTH] = { { 0x00, 0x00, 0x08, 0x01, 0x00, 0x00, 0x00, 0x00 }, /* C_PIN_TABLE object ID's */ - [OPAL_C_PIN_MSID] = { 0x00, 0x00, 0x00, 0x0B, 0x00, 0x00, 0x84, 0x02}, [OPAL_C_PIN_SID] = @@ -161,7 +159,6 @@ static const u8 opaluid[][OPAL_UID_LENGTH] = { { 0x00, 0x00, 0x00, 0x0B, 0x00, 0x01, 0x00, 0x01}, /* half UID's (only first 4 bytes used) */ - [OPAL_HALF_UID_AUTHORITY_OBJ_REF] = { 0x00, 0x00, 0x0C, 0x05, 0xff, 0xff, 0xff, 0xff }, [OPAL_HALF_UID_BOOLEAN_ACE] = @@ -517,6 +514,7 @@ static int opal_discovery0(struct opal_dev *dev, void *data) ret = opal_recv_cmd(dev); if (ret) return ret; + return opal_discovery0_end(dev); } @@ -525,6 +523,7 @@ static int opal_discovery0_step(struct opal_dev *dev) const struct opal_step discovery0_step = { opal_discovery0, }; + return execute_step(dev, &discovery0_step, 0); } @@ -551,6 +550,7 @@ static void add_token_u8(int *err, struct opal_dev *cmd, u8 tok) { if (!can_add(err, cmd, 1)) return; + cmd->cmd[cmd->pos++] = tok; } @@ -577,6 +577,7 @@ static void add_medium_atom_header(struct opal_dev *cmd, bool bytestring, header0 |= bytestring ? MEDIUM_ATOM_BYTESTRING : 0; header0 |= has_sign ? MEDIUM_ATOM_SIGNED : 0; header0 |= (len >> 8) & MEDIUM_ATOM_LEN_MASK; + cmd->cmd[cmd->pos++] = header0; cmd->cmd[cmd->pos++] = len; } @@ -649,6 +650,7 @@ static int build_locking_range(u8 *buffer, size_t length, u8 lr) if (lr == 0) return 0; + buffer[5] = LOCKING_RANGE_NON_GLOBAL; buffer[7] = lr; @@ -903,10 +905,6 @@ static int response_parse(const u8 *buf, size_t length, num_entries++; } - if (num_entries == 0) { - pr_debug("Couldn't parse response.\n"); - return -EINVAL; - } resp->num = num_entries; return 0; @@ -945,6 +943,7 @@ static size_t response_get_string(const struct parsed_resp *resp, int n, } *store = tok->pos + skip; + return tok->len - skip; } @@ -1062,6 +1061,7 @@ static int start_opal_session_cont(struct opal_dev *dev) dev->hsn = hsn; dev->tsn = tsn; + return 0; } @@ -1084,6 +1084,7 @@ static int end_session_cont(struct opal_dev *dev) { dev->hsn = 0; dev->tsn = 0; + return parse_and_check_status(dev); } @@ -1172,6 +1173,7 @@ static int gen_key(struct opal_dev *dev, void *data) return err; } + return finalize_and_send(dev, parse_and_check_status); } @@ -1184,12 +1186,14 @@ static int get_active_key_cont(struct opal_dev *dev) error = parse_and_check_status(dev); if (error) return error; + keylen = response_get_string(&dev->parsed, 4, &activekey); if (!activekey) { pr_debug("%s: Couldn't extract the Activekey from the response\n", __func__); return OPAL_INVAL_PARAM; } + dev->prev_data = kmemdup(activekey, keylen, GFP_KERNEL); if (!dev->prev_data) @@ -1251,6 +1255,7 @@ static int generic_lr_enable_disable(struct opal_dev *dev, add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); + return err; } @@ -1263,6 +1268,7 @@ static inline int enable_global_lr(struct opal_dev *dev, u8 *uid, 0, 0); if (err) pr_debug("Failed to create enable global lr command\n"); + return err; } @@ -1313,7 +1319,6 @@ static int setup_locking_range(struct opal_dev *dev, void *data) if (err) { pr_debug("Error building Setup Locking range command.\n"); return err; - } return finalize_and_send(dev, parse_and_check_status); @@ -1393,6 +1398,7 @@ static int start_SIDASP_opal_session(struct opal_dev *dev, void *data) kfree(key); dev->prev_data = NULL; } + return ret; } @@ -1518,6 +1524,7 @@ static int erase_locking_range(struct opal_dev *dev, void *data) pr_debug("Error building Erase Locking Range Command.\n"); return err; } + return finalize_and_send(dev, parse_and_check_status); } @@ -1636,6 +1643,7 @@ static int write_shadow_mbr(struct opal_dev *dev, void *data) off += len; } + return err; } @@ -1816,6 +1824,7 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data) pr_debug("Error building SET command.\n"); return err; } + return finalize_and_send(dev, parse_and_check_status); } @@ -1857,6 +1866,7 @@ static int lock_unlock_locking_range_sum(struct opal_dev *dev, void *data) pr_debug("Error building SET command.\n"); return ret; } + return finalize_and_send(dev, parse_and_check_status); } @@ -1957,6 +1967,7 @@ static int end_opal_session(struct opal_dev *dev, void *data) if (err < 0) return err; + return finalize_and_send(dev, end_session_cont); } @@ -1965,6 +1976,7 @@ static int end_opal_session_error(struct opal_dev *dev) const struct opal_step error_end_session = { end_opal_session, }; + return execute_step(dev, &error_end_session, 0); } @@ -1984,6 +1996,7 @@ static int check_opal_support(struct opal_dev *dev) ret = opal_discovery0_step(dev); dev->supported = !ret; mutex_unlock(&dev->dev_lock); + return ret; } @@ -2004,6 +2017,7 @@ void free_opal_dev(struct opal_dev *dev) { if (!dev) return; + clean_opal_dev(dev); kfree(dev); } @@ -2026,6 +2040,7 @@ struct opal_dev *init_opal_dev(void *data, sec_send_recv *send_recv) kfree(dev); return NULL; } + return dev; } EXPORT_SYMBOL(init_opal_dev); @@ -2045,6 +2060,7 @@ static int opal_secure_erase_locking_range(struct opal_dev *dev, setup_opal_dev(dev); ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2062,6 +2078,7 @@ static int opal_erase_locking_range(struct opal_dev *dev, setup_opal_dev(dev); ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2089,6 +2106,7 @@ static int opal_enable_disable_shadow_mbr(struct opal_dev *dev, setup_opal_dev(dev); ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2113,6 +2131,7 @@ static int opal_set_mbr_done(struct opal_dev *dev, setup_opal_dev(dev); ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2133,6 +2152,7 @@ static int opal_write_shadow_mbr(struct opal_dev *dev, setup_opal_dev(dev); ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2151,6 +2171,7 @@ static int opal_save(struct opal_dev *dev, struct opal_lock_unlock *lk_unlk) setup_opal_dev(dev); add_suspend_info(dev, suspend); mutex_unlock(&dev->dev_lock); + return 0; } @@ -2169,12 +2190,14 @@ static int opal_add_user_to_lr(struct opal_dev *dev, pr_debug("Locking state was not RO or RW\n"); return -EINVAL; } + if (lk_unlk->session.who < OPAL_USER1 || lk_unlk->session.who > OPAL_USER9) { pr_debug("Authority was not within the range of users: %d\n", lk_unlk->session.who); return -EINVAL; } + if (lk_unlk->session.sum) { pr_debug("%s not supported in sum. Use setup locking range\n", __func__); @@ -2185,6 +2208,7 @@ static int opal_add_user_to_lr(struct opal_dev *dev, setup_opal_dev(dev); ret = execute_steps(dev, steps, ARRAY_SIZE(steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2267,6 +2291,7 @@ static int opal_lock_unlock(struct opal_dev *dev, mutex_lock(&dev->dev_lock); ret = __opal_lock_unlock(dev, lk_unlk); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2289,6 +2314,7 @@ static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal) setup_opal_dev(dev); ret = execute_steps(dev, owner_steps, ARRAY_SIZE(owner_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2310,6 +2336,7 @@ static int opal_activate_lsp(struct opal_dev *dev, setup_opal_dev(dev); ret = execute_steps(dev, active_steps, ARRAY_SIZE(active_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2327,6 +2354,7 @@ static int opal_setup_locking_range(struct opal_dev *dev, setup_opal_dev(dev); ret = execute_steps(dev, lr_steps, ARRAY_SIZE(lr_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2347,6 +2375,7 @@ static int opal_set_new_pw(struct opal_dev *dev, struct opal_new_pw *opal_pw) setup_opal_dev(dev); ret = execute_steps(dev, pw_steps, ARRAY_SIZE(pw_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2371,6 +2400,7 @@ static int opal_activate_user(struct opal_dev *dev, setup_opal_dev(dev); ret = execute_steps(dev, act_steps, ARRAY_SIZE(act_steps)); mutex_unlock(&dev->dev_lock); + return ret; } @@ -2382,6 +2412,7 @@ bool opal_unlock_from_suspend(struct opal_dev *dev) if (!dev) return false; + if (!dev->supported) return false; @@ -2399,6 +2430,7 @@ bool opal_unlock_from_suspend(struct opal_dev *dev) suspend->unlk.session.sum); was_failure = true; } + if (dev->mbr_enabled) { ret = __opal_set_mbr_done(dev, &suspend->unlk.session.opal_key); if (ret) @@ -2406,6 +2438,7 @@ bool opal_unlock_from_suspend(struct opal_dev *dev) } } mutex_unlock(&dev->dev_lock); + return was_failure; } EXPORT_SYMBOL(opal_unlock_from_suspend); diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index 0469aceaa230..485865fd0412 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -3780,7 +3780,7 @@ static int compat_getdrvprm(int drive, v.native_format = UDP->native_format; mutex_unlock(&floppy_mutex); - if (copy_from_user(arg, &v, sizeof(struct compat_floppy_drive_params))) + if (copy_to_user(arg, &v, sizeof(struct compat_floppy_drive_params))) return -EFAULT; return 0; } @@ -3816,7 +3816,7 @@ static int compat_getdrvstat(int drive, bool poll, v.bufblocks = UDRS->bufblocks; mutex_unlock(&floppy_mutex); - if (copy_from_user(arg, &v, sizeof(struct compat_floppy_drive_struct))) + if (copy_to_user(arg, &v, sizeof(struct compat_floppy_drive_struct))) return -EFAULT; return 0; Eintr: diff --git a/drivers/block/loop.c b/drivers/block/loop.c index ab7ca5989097..1410fa893653 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1755,6 +1755,7 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, case LOOP_SET_FD: case LOOP_CHANGE_FD: case LOOP_SET_BLOCK_SIZE: + case LOOP_SET_DIRECT_IO: err = lo_ioctl(bdev, mode, cmd, arg); break; default: diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index e21d2ded732b..a8e3815295fe 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -108,6 +108,7 @@ struct nbd_device { struct nbd_config *config; struct mutex config_lock; struct gendisk *disk; + struct workqueue_struct *recv_workq; struct list_head list; struct task_struct *task_recv; @@ -121,6 +122,7 @@ struct nbd_cmd { struct mutex lock; int index; int cookie; + int retries; blk_status_t status; unsigned long flags; u32 cmd_cookie; @@ -138,7 +140,6 @@ static struct dentry *nbd_dbg_dir; static unsigned int nbds_max = 16; static int max_part = 16; -static struct workqueue_struct *recv_workqueue; static int part_shift; static int nbd_dev_dbg_init(struct nbd_device *nbd); @@ -344,6 +345,22 @@ static void sock_shutdown(struct nbd_device *nbd) dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n"); } +static u32 req_to_nbd_cmd_type(struct request *req) +{ + switch (req_op(req)) { + case REQ_OP_DISCARD: + return NBD_CMD_TRIM; + case REQ_OP_FLUSH: + return NBD_CMD_FLUSH; + case REQ_OP_WRITE: + return NBD_CMD_WRITE; + case REQ_OP_READ: + return NBD_CMD_READ; + default: + return U32_MAX; + } +} + static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, bool reserved) { @@ -357,8 +374,10 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, } config = nbd->config; - if (!mutex_trylock(&cmd->lock)) + if (!mutex_trylock(&cmd->lock)) { + nbd_config_put(nbd); return BLK_EH_RESET_TIMER; + } if (config->num_connections > 1) { dev_err_ratelimited(nbd_to_dev(nbd), @@ -389,10 +408,25 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, nbd_config_put(nbd); return BLK_EH_DONE; } - } else { - dev_err_ratelimited(nbd_to_dev(nbd), - "Connection timed out\n"); } + + if (!nbd->tag_set.timeout) { + /* + * Userspace sets timeout=0 to disable socket disconnection, + * so just warn and reset the timer. + */ + cmd->retries++; + dev_info(nbd_to_dev(nbd), "Possible stuck request %p: control (%s@%llu,%uB). Runtime %u seconds\n", + req, nbdcmd_to_ascii(req_to_nbd_cmd_type(req)), + (unsigned long long)blk_rq_pos(req) << 9, + blk_rq_bytes(req), (req->timeout / HZ) * cmd->retries); + + mutex_unlock(&cmd->lock); + nbd_config_put(nbd); + return BLK_EH_RESET_TIMER; + } + + dev_err_ratelimited(nbd_to_dev(nbd), "Connection timed out\n"); set_bit(NBD_TIMEDOUT, &config->runtime_flags); cmd->status = BLK_STS_IOERR; mutex_unlock(&cmd->lock); @@ -480,22 +514,9 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index) iov_iter_kvec(&from, WRITE, &iov, 1, sizeof(request)); - switch (req_op(req)) { - case REQ_OP_DISCARD: - type = NBD_CMD_TRIM; - break; - case REQ_OP_FLUSH: - type = NBD_CMD_FLUSH; - break; - case REQ_OP_WRITE: - type = NBD_CMD_WRITE; - break; - case REQ_OP_READ: - type = NBD_CMD_READ; - break; - default: + type = req_to_nbd_cmd_type(req); + if (type == U32_MAX) return -EIO; - } if (rq_data_dir(req) == WRITE && (config->flags & NBD_FLAG_READ_ONLY)) { @@ -526,6 +547,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index) } cmd->index = index; cmd->cookie = nsock->cookie; + cmd->retries = 0; request.type = htonl(type | nbd_cmd_flags); if (type != NBD_CMD_FLUSH) { request.from = cpu_to_be64((u64)blk_rq_pos(req) << 9); @@ -1036,7 +1058,7 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, unsigned long arg) /* We take the tx_mutex in an error path in the recv_work, so we * need to queue_work outside of the tx_mutex. */ - queue_work(recv_workqueue, &args->work); + queue_work(nbd->recv_workq, &args->work); atomic_inc(&config->live_connections); wake_up(&config->conn_wait); @@ -1137,6 +1159,10 @@ static void nbd_config_put(struct nbd_device *nbd) kfree(nbd->config); nbd->config = NULL; + if (nbd->recv_workq) + destroy_workqueue(nbd->recv_workq); + nbd->recv_workq = NULL; + nbd->tag_set.timeout = 0; nbd->disk->queue->limits.discard_granularity = 0; nbd->disk->queue->limits.discard_alignment = 0; @@ -1165,6 +1191,14 @@ static int nbd_start_device(struct nbd_device *nbd) return -EINVAL; } + nbd->recv_workq = alloc_workqueue("knbd%d-recv", + WQ_MEM_RECLAIM | WQ_HIGHPRI | + WQ_UNBOUND, 0, nbd->index); + if (!nbd->recv_workq) { + dev_err(disk_to_dev(nbd->disk), "Could not allocate knbd recv work queue.\n"); + return -ENOMEM; + } + blk_mq_update_nr_hw_queues(&nbd->tag_set, config->num_connections); nbd->task_recv = current; @@ -1195,7 +1229,7 @@ static int nbd_start_device(struct nbd_device *nbd) INIT_WORK(&args->work, recv_work); args->nbd = nbd; args->index = i; - queue_work(recv_workqueue, &args->work); + queue_work(nbd->recv_workq, &args->work); } nbd_size_update(nbd); return error; @@ -1215,8 +1249,10 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b mutex_unlock(&nbd->config_lock); ret = wait_event_interruptible(config->recv_wq, atomic_read(&config->recv_threads) == 0); - if (ret) + if (ret) { sock_shutdown(nbd); + flush_workqueue(nbd->recv_workq); + } mutex_lock(&nbd->config_lock); nbd_bdev_reset(bdev); /* user requested, ignore socket errors */ @@ -1246,6 +1282,13 @@ static bool nbd_is_valid_blksize(unsigned long blksize) return true; } +static void nbd_set_cmd_timeout(struct nbd_device *nbd, u64 timeout) +{ + nbd->tag_set.timeout = timeout * HZ; + if (timeout) + blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ); +} + /* Must be called with config_lock held */ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, unsigned int cmd, unsigned long arg) @@ -1276,10 +1319,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, nbd_size_set(nbd, config->blksize, arg); return 0; case NBD_SET_TIMEOUT: - if (arg) { - nbd->tag_set.timeout = arg * HZ; - blk_queue_rq_timeout(nbd->disk->queue, arg * HZ); - } + nbd_set_cmd_timeout(nbd, arg); return 0; case NBD_SET_FLAGS: @@ -1799,11 +1839,9 @@ again: if (ret) goto out; - if (info->attrs[NBD_ATTR_TIMEOUT]) { - u64 timeout = nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]); - nbd->tag_set.timeout = timeout * HZ; - blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ); - } + if (info->attrs[NBD_ATTR_TIMEOUT]) + nbd_set_cmd_timeout(nbd, + nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT])); if (info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]) { config->dead_conn_timeout = nla_get_u64(info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]); @@ -1875,6 +1913,12 @@ static void nbd_disconnect_and_put(struct nbd_device *nbd) nbd_disconnect(nbd); nbd_clear_sock(nbd); mutex_unlock(&nbd->config_lock); + /* + * Make sure recv thread has finished, so it does not drop the last + * config ref and try to destroy the workqueue from inside the work + * queue. + */ + flush_workqueue(nbd->recv_workq); if (test_and_clear_bit(NBD_HAS_CONFIG_REF, &nbd->config->runtime_flags)) nbd_config_put(nbd); @@ -1971,11 +2015,9 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info) if (ret) goto out; - if (info->attrs[NBD_ATTR_TIMEOUT]) { - u64 timeout = nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]); - nbd->tag_set.timeout = timeout * HZ; - blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ); - } + if (info->attrs[NBD_ATTR_TIMEOUT]) + nbd_set_cmd_timeout(nbd, + nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT])); if (info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]) { config->dead_conn_timeout = nla_get_u64(info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]); @@ -2261,20 +2303,12 @@ static int __init nbd_init(void) if (nbds_max > 1UL << (MINORBITS - part_shift)) return -EINVAL; - recv_workqueue = alloc_workqueue("knbd-recv", - WQ_MEM_RECLAIM | WQ_HIGHPRI | - WQ_UNBOUND, 0); - if (!recv_workqueue) - return -ENOMEM; - if (register_blkdev(NBD_MAJOR, "nbd")) { - destroy_workqueue(recv_workqueue); + if (register_blkdev(NBD_MAJOR, "nbd")) return -EIO; - } if (genl_register_family(&nbd_genl_family)) { unregister_blkdev(NBD_MAJOR, "nbd"); - destroy_workqueue(recv_workqueue); return -EINVAL; } nbd_dbg_init(); @@ -2316,7 +2350,6 @@ static void __exit nbd_cleanup(void) idr_destroy(&nbd_index_idr); genl_unregister_family(&nbd_genl_family); - destroy_workqueue(recv_workqueue); unregister_blkdev(NBD_MAJOR, "nbd"); } diff --git a/drivers/block/null_blk.h b/drivers/block/null_blk.h index a1b9929bd911..a235c45e22a7 100644 --- a/drivers/block/null_blk.h +++ b/drivers/block/null_blk.h @@ -2,6 +2,9 @@ #ifndef __BLK_NULL_BLK_H #define __BLK_NULL_BLK_H +#undef pr_fmt +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include <linux/blkdev.h> #include <linux/slab.h> #include <linux/blk-mq.h> @@ -90,13 +93,13 @@ int null_zone_init(struct nullb_device *dev); void null_zone_exit(struct nullb_device *dev); int null_zone_report(struct gendisk *disk, sector_t sector, struct blk_zone *zones, unsigned int *nr_zones); -void null_zone_write(struct nullb_cmd *cmd, sector_t sector, - unsigned int nr_sectors); -void null_zone_reset(struct nullb_cmd *cmd, sector_t sector); +blk_status_t null_handle_zoned(struct nullb_cmd *cmd, + enum req_opf op, sector_t sector, + sector_t nr_sectors); #else static inline int null_zone_init(struct nullb_device *dev) { - pr_err("null_blk: CONFIG_BLK_DEV_ZONED not enabled\n"); + pr_err("CONFIG_BLK_DEV_ZONED not enabled\n"); return -EINVAL; } static inline void null_zone_exit(struct nullb_device *dev) {} @@ -106,10 +109,11 @@ static inline int null_zone_report(struct gendisk *disk, sector_t sector, { return -EOPNOTSUPP; } -static inline void null_zone_write(struct nullb_cmd *cmd, sector_t sector, - unsigned int nr_sectors) +static inline blk_status_t null_handle_zoned(struct nullb_cmd *cmd, + enum req_opf op, sector_t sector, + sector_t nr_sectors) { + return BLK_STS_NOTSUPP; } -static inline void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) {} #endif /* CONFIG_BLK_DEV_ZONED */ #endif /* __NULL_BLK_H */ diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c index 99328ded60d1..0e7da5015ccd 100644 --- a/drivers/block/null_blk_main.c +++ b/drivers/block/null_blk_main.c @@ -141,8 +141,8 @@ static int g_bs = 512; module_param_named(bs, g_bs, int, 0444); MODULE_PARM_DESC(bs, "Block size (in bytes)"); -static int nr_devices = 1; -module_param(nr_devices, int, 0444); +static unsigned int nr_devices = 1; +module_param(nr_devices, uint, 0444); MODULE_PARM_DESC(nr_devices, "Number of devices to register"); static bool g_blocking; @@ -1133,93 +1133,61 @@ static void null_restart_queue_async(struct nullb *nullb) blk_mq_start_stopped_hw_queues(q, true); } -static blk_status_t null_handle_cmd(struct nullb_cmd *cmd) +static inline blk_status_t null_handle_throttled(struct nullb_cmd *cmd) { struct nullb_device *dev = cmd->nq->dev; struct nullb *nullb = dev->nullb; - int err = 0; + blk_status_t sts = BLK_STS_OK; + struct request *rq = cmd->rq; - if (test_bit(NULLB_DEV_FL_THROTTLED, &dev->flags)) { - struct request *rq = cmd->rq; - - if (!hrtimer_active(&nullb->bw_timer)) - hrtimer_restart(&nullb->bw_timer); - - if (atomic_long_sub_return(blk_rq_bytes(rq), - &nullb->cur_bytes) < 0) { - null_stop_queue(nullb); - /* race with timer */ - if (atomic_long_read(&nullb->cur_bytes) > 0) - null_restart_queue_async(nullb); - /* requeue request */ - return BLK_STS_DEV_RESOURCE; - } - } + if (!hrtimer_active(&nullb->bw_timer)) + hrtimer_restart(&nullb->bw_timer); - if (nullb->dev->badblocks.shift != -1) { - int bad_sectors; - sector_t sector, size, first_bad; - bool is_flush = true; - - if (dev->queue_mode == NULL_Q_BIO && - bio_op(cmd->bio) != REQ_OP_FLUSH) { - is_flush = false; - sector = cmd->bio->bi_iter.bi_sector; - size = bio_sectors(cmd->bio); - } - if (dev->queue_mode != NULL_Q_BIO && - req_op(cmd->rq) != REQ_OP_FLUSH) { - is_flush = false; - sector = blk_rq_pos(cmd->rq); - size = blk_rq_sectors(cmd->rq); - } - if (!is_flush && badblocks_check(&nullb->dev->badblocks, sector, - size, &first_bad, &bad_sectors)) { - cmd->error = BLK_STS_IOERR; - goto out; - } + if (atomic_long_sub_return(blk_rq_bytes(rq), &nullb->cur_bytes) < 0) { + null_stop_queue(nullb); + /* race with timer */ + if (atomic_long_read(&nullb->cur_bytes) > 0) + null_restart_queue_async(nullb); + /* requeue request */ + sts = BLK_STS_DEV_RESOURCE; } + return sts; +} - if (dev->memory_backed) { - if (dev->queue_mode == NULL_Q_BIO) { - if (bio_op(cmd->bio) == REQ_OP_FLUSH) - err = null_handle_flush(nullb); - else - err = null_handle_bio(cmd); - } else { - if (req_op(cmd->rq) == REQ_OP_FLUSH) - err = null_handle_flush(nullb); - else - err = null_handle_rq(cmd); - } - } - cmd->error = errno_to_blk_status(err); +static inline blk_status_t null_handle_badblocks(struct nullb_cmd *cmd, + sector_t sector, + sector_t nr_sectors) +{ + struct badblocks *bb = &cmd->nq->dev->badblocks; + sector_t first_bad; + int bad_sectors; - if (!cmd->error && dev->zoned) { - sector_t sector; - unsigned int nr_sectors; - enum req_opf op; + if (badblocks_check(bb, sector, nr_sectors, &first_bad, &bad_sectors)) + return BLK_STS_IOERR; - if (dev->queue_mode == NULL_Q_BIO) { - op = bio_op(cmd->bio); - sector = cmd->bio->bi_iter.bi_sector; - nr_sectors = cmd->bio->bi_iter.bi_size >> 9; - } else { - op = req_op(cmd->rq); - sector = blk_rq_pos(cmd->rq); - nr_sectors = blk_rq_sectors(cmd->rq); - } + return BLK_STS_OK; +} - if (op == REQ_OP_WRITE) - null_zone_write(cmd, sector, nr_sectors); - else if (op == REQ_OP_ZONE_RESET) - null_zone_reset(cmd, sector); - } -out: +static inline blk_status_t null_handle_memory_backed(struct nullb_cmd *cmd, + enum req_opf op) +{ + struct nullb_device *dev = cmd->nq->dev; + int err; + + if (dev->queue_mode == NULL_Q_BIO) + err = null_handle_bio(cmd); + else + err = null_handle_rq(cmd); + + return errno_to_blk_status(err); +} + +static inline void nullb_complete_cmd(struct nullb_cmd *cmd) +{ /* Complete IO by inline, softirq or timer */ - switch (dev->irqmode) { + switch (cmd->nq->dev->irqmode) { case NULL_IRQ_SOFTIRQ: - switch (dev->queue_mode) { + switch (cmd->nq->dev->queue_mode) { case NULL_Q_MQ: blk_mq_complete_request(cmd->rq); break; @@ -1238,6 +1206,40 @@ out: null_cmd_end_timer(cmd); break; } +} + +static blk_status_t null_handle_cmd(struct nullb_cmd *cmd, sector_t sector, + sector_t nr_sectors, enum req_opf op) +{ + struct nullb_device *dev = cmd->nq->dev; + struct nullb *nullb = dev->nullb; + blk_status_t sts; + + if (test_bit(NULLB_DEV_FL_THROTTLED, &dev->flags)) { + sts = null_handle_throttled(cmd); + if (sts != BLK_STS_OK) + return sts; + } + + if (op == REQ_OP_FLUSH) { + cmd->error = errno_to_blk_status(null_handle_flush(nullb)); + goto out; + } + + if (nullb->dev->badblocks.shift != -1) { + cmd->error = null_handle_badblocks(cmd, sector, nr_sectors); + if (cmd->error != BLK_STS_OK) + goto out; + } + + if (dev->memory_backed) + cmd->error = null_handle_memory_backed(cmd, op); + + if (!cmd->error && dev->zoned) + cmd->error = null_handle_zoned(cmd, op, sector, nr_sectors); + +out: + nullb_complete_cmd(cmd); return BLK_STS_OK; } @@ -1280,6 +1282,8 @@ static struct nullb_queue *nullb_to_queue(struct nullb *nullb) static blk_qc_t null_queue_bio(struct request_queue *q, struct bio *bio) { + sector_t sector = bio->bi_iter.bi_sector; + sector_t nr_sectors = bio_sectors(bio); struct nullb *nullb = q->queuedata; struct nullb_queue *nq = nullb_to_queue(nullb); struct nullb_cmd *cmd; @@ -1287,7 +1291,7 @@ static blk_qc_t null_queue_bio(struct request_queue *q, struct bio *bio) cmd = alloc_cmd(nq, 1); cmd->bio = bio; - null_handle_cmd(cmd); + null_handle_cmd(cmd, sector, nr_sectors, bio_op(bio)); return BLK_QC_T_NONE; } @@ -1311,7 +1315,7 @@ static bool should_requeue_request(struct request *rq) static enum blk_eh_timer_return null_timeout_rq(struct request *rq, bool res) { - pr_info("null: rq %p timed out\n", rq); + pr_info("rq %p timed out\n", rq); blk_mq_complete_request(rq); return BLK_EH_DONE; } @@ -1321,6 +1325,8 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx, { struct nullb_cmd *cmd = blk_mq_rq_to_pdu(bd->rq); struct nullb_queue *nq = hctx->driver_data; + sector_t nr_sectors = blk_rq_sectors(bd->rq); + sector_t sector = blk_rq_pos(bd->rq); might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); @@ -1349,7 +1355,7 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx, if (should_timeout_request(bd->rq)) return BLK_STS_OK; - return null_handle_cmd(cmd); + return null_handle_cmd(cmd, sector, nr_sectors, req_op(bd->rq)); } static const struct blk_mq_ops null_mq_ops = { @@ -1688,6 +1694,9 @@ static int null_add_dev(struct nullb_device *dev) blk_queue_chunk_sectors(nullb->q, dev->zone_size_sects); nullb->q->limits.zoned = BLK_ZONED_HM; + blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, nullb->q); + blk_queue_required_elevator_features(nullb->q, + ELEVATOR_F_ZBD_SEQ_WRITE); } nullb->q->queuedata = nullb; @@ -1739,28 +1748,28 @@ static int __init null_init(void) struct nullb_device *dev; if (g_bs > PAGE_SIZE) { - pr_warn("null_blk: invalid block size\n"); - pr_warn("null_blk: defaults block size to %lu\n", PAGE_SIZE); + pr_warn("invalid block size\n"); + pr_warn("defaults block size to %lu\n", PAGE_SIZE); g_bs = PAGE_SIZE; } if (!is_power_of_2(g_zone_size)) { - pr_err("null_blk: zone_size must be power-of-two\n"); + pr_err("zone_size must be power-of-two\n"); return -EINVAL; } if (g_home_node != NUMA_NO_NODE && g_home_node >= nr_online_nodes) { - pr_err("null_blk: invalid home_node value\n"); + pr_err("invalid home_node value\n"); g_home_node = NUMA_NO_NODE; } if (g_queue_mode == NULL_Q_RQ) { - pr_err("null_blk: legacy IO path no longer available\n"); + pr_err("legacy IO path no longer available\n"); return -EINVAL; } if (g_queue_mode == NULL_Q_MQ && g_use_per_node_hctx) { if (g_submit_queues != nr_online_nodes) { - pr_warn("null_blk: submit_queues param is set to %u.\n", + pr_warn("submit_queues param is set to %u.\n", nr_online_nodes); g_submit_queues = nr_online_nodes; } @@ -1803,7 +1812,7 @@ static int __init null_init(void) } } - pr_info("null: module loaded\n"); + pr_info("module loaded\n"); return 0; err_dev: diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c index cb28d93f2bd1..eabc116832a7 100644 --- a/drivers/block/null_blk_zoned.c +++ b/drivers/block/null_blk_zoned.c @@ -17,7 +17,7 @@ int null_zone_init(struct nullb_device *dev) unsigned int i; if (!is_power_of_2(dev->zone_size)) { - pr_err("null_blk: zone_size must be power-of-two\n"); + pr_err("zone_size must be power-of-two\n"); return -EINVAL; } @@ -31,7 +31,7 @@ int null_zone_init(struct nullb_device *dev) if (dev->zone_nr_conv >= dev->nr_zones) { dev->zone_nr_conv = dev->nr_zones - 1; - pr_info("null_blk: changed the number of conventional zones to %u", + pr_info("changed the number of conventional zones to %u", dev->zone_nr_conv); } @@ -84,7 +84,7 @@ int null_zone_report(struct gendisk *disk, sector_t sector, return 0; } -void null_zone_write(struct nullb_cmd *cmd, sector_t sector, +static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector, unsigned int nr_sectors) { struct nullb_device *dev = cmd->nq->dev; @@ -95,14 +95,12 @@ void null_zone_write(struct nullb_cmd *cmd, sector_t sector, case BLK_ZONE_COND_FULL: /* Cannot write to a full zone */ cmd->error = BLK_STS_IOERR; - break; + return BLK_STS_IOERR; case BLK_ZONE_COND_EMPTY: case BLK_ZONE_COND_IMP_OPEN: /* Writes must be at the write pointer position */ - if (sector != zone->wp) { - cmd->error = BLK_STS_IOERR; - break; - } + if (sector != zone->wp) + return BLK_STS_IOERR; if (zone->cond == BLK_ZONE_COND_EMPTY) zone->cond = BLK_ZONE_COND_IMP_OPEN; @@ -115,22 +113,51 @@ void null_zone_write(struct nullb_cmd *cmd, sector_t sector, break; default: /* Invalid zone condition */ - cmd->error = BLK_STS_IOERR; - break; + return BLK_STS_IOERR; } + return BLK_STS_OK; } -void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) +static blk_status_t null_zone_reset(struct nullb_cmd *cmd, sector_t sector) { struct nullb_device *dev = cmd->nq->dev; unsigned int zno = null_zone_no(dev, sector); struct blk_zone *zone = &dev->zones[zno]; + size_t i; + + switch (req_op(cmd->rq)) { + case REQ_OP_ZONE_RESET_ALL: + for (i = 0; i < dev->nr_zones; i++) { + if (zone[i].type == BLK_ZONE_TYPE_CONVENTIONAL) + continue; + zone[i].cond = BLK_ZONE_COND_EMPTY; + zone[i].wp = zone[i].start; + } + break; + case REQ_OP_ZONE_RESET: + if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) + return BLK_STS_IOERR; - if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) { - cmd->error = BLK_STS_IOERR; - return; + zone->cond = BLK_ZONE_COND_EMPTY; + zone->wp = zone->start; + break; + default: + cmd->error = BLK_STS_NOTSUPP; + break; } + return BLK_STS_OK; +} - zone->cond = BLK_ZONE_COND_EMPTY; - zone->wp = zone->start; +blk_status_t null_handle_zoned(struct nullb_cmd *cmd, enum req_opf op, + sector_t sector, sector_t nr_sectors) +{ + switch (op) { + case REQ_OP_WRITE: + return null_zone_write(cmd, sector, nr_sectors); + case REQ_OP_ZONE_RESET: + case REQ_OP_ZONE_RESET_ALL: + return null_zone_reset(cmd, sector); + default: + return BLK_STS_OK; + } } diff --git a/drivers/block/paride/pcd.c b/drivers/block/paride/pcd.c index 001dbdcbf355..636bfea2de6f 100644 --- a/drivers/block/paride/pcd.c +++ b/drivers/block/paride/pcd.c @@ -314,8 +314,8 @@ static void pcd_init_units(void) disk->queue = blk_mq_init_sq_queue(&cd->tag_set, &pcd_mq_ops, 1, BLK_MQ_F_SHOULD_MERGE); if (IS_ERR(disk->queue)) { - put_disk(disk); disk->queue = NULL; + put_disk(disk); continue; } @@ -723,9 +723,9 @@ static int pcd_detect(void) k = 0; if (pcd_drive_count == 0) { /* nothing spec'd - so autoprobe for 1 */ cd = pcd; - if (pi_init(cd->pi, 1, -1, -1, -1, -1, -1, pcd_buffer, - PI_PCD, verbose, cd->name)) { - if (!pcd_probe(cd, -1, id) && cd->disk) { + if (cd->disk && pi_init(cd->pi, 1, -1, -1, -1, -1, -1, + pcd_buffer, PI_PCD, verbose, cd->name)) { + if (!pcd_probe(cd, -1, id)) { cd->present = 1; k++; } else @@ -736,11 +736,13 @@ static int pcd_detect(void) int *conf = *drives[unit]; if (!conf[D_PRT]) continue; + if (!cd->disk) + continue; if (!pi_init(cd->pi, 0, conf[D_PRT], conf[D_MOD], conf[D_UNI], conf[D_PRO], conf[D_DLY], pcd_buffer, PI_PCD, verbose, cd->name)) continue; - if (!pcd_probe(cd, conf[D_SLV], id) && cd->disk) { + if (!pcd_probe(cd, conf[D_SLV], id)) { cd->present = 1; k++; } else diff --git a/drivers/block/paride/pf.c b/drivers/block/paride/pf.c index 1e9c50a7256c..6b7d4cab3687 100644 --- a/drivers/block/paride/pf.c +++ b/drivers/block/paride/pf.c @@ -300,8 +300,8 @@ static void __init pf_init_units(void) disk->queue = blk_mq_init_sq_queue(&pf->tag_set, &pf_mq_ops, 1, BLK_MQ_F_SHOULD_MERGE); if (IS_ERR(disk->queue)) { - put_disk(disk); disk->queue = NULL; + put_disk(disk); continue; } diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index a600934fdd9c..7543e395a2c6 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -4,6 +4,8 @@ * Initial release: Matias Bjorling <m@bjorling.me> */ +#define pr_fmt(fmt) "nvm: " fmt + #include <linux/list.h> #include <linux/types.h> #include <linux/sem.h> @@ -74,7 +76,7 @@ static int nvm_reserve_luns(struct nvm_dev *dev, int lun_begin, int lun_end) for (i = lun_begin; i <= lun_end; i++) { if (test_and_set_bit(i, dev->lun_map)) { - pr_err("nvm: lun %d already allocated\n", i); + pr_err("lun %d already allocated\n", i); goto err; } } @@ -264,7 +266,7 @@ static int nvm_config_check_luns(struct nvm_geo *geo, int lun_begin, int lun_end) { if (lun_begin > lun_end || lun_end >= geo->all_luns) { - pr_err("nvm: lun out of bound (%u:%u > %u)\n", + pr_err("lun out of bound (%u:%u > %u)\n", lun_begin, lun_end, geo->all_luns - 1); return -EINVAL; } @@ -297,7 +299,7 @@ static int __nvm_config_extended(struct nvm_dev *dev, if (e->op == 0xFFFF) { e->op = NVM_TARGET_DEFAULT_OP; } else if (e->op < NVM_TARGET_MIN_OP || e->op > NVM_TARGET_MAX_OP) { - pr_err("nvm: invalid over provisioning value\n"); + pr_err("invalid over provisioning value\n"); return -EINVAL; } @@ -334,23 +336,23 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create) e = create->conf.e; break; default: - pr_err("nvm: config type not valid\n"); + pr_err("config type not valid\n"); return -EINVAL; } tt = nvm_find_target_type(create->tgttype); if (!tt) { - pr_err("nvm: target type %s not found\n", create->tgttype); + pr_err("target type %s not found\n", create->tgttype); return -EINVAL; } if ((tt->flags & NVM_TGT_F_HOST_L2P) != (dev->geo.dom & NVM_RSP_L2P)) { - pr_err("nvm: device is incompatible with target L2P type.\n"); + pr_err("device is incompatible with target L2P type.\n"); return -EINVAL; } if (nvm_target_exists(create->tgtname)) { - pr_err("nvm: target name already exists (%s)\n", + pr_err("target name already exists (%s)\n", create->tgtname); return -EINVAL; } @@ -367,7 +369,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create) tgt_dev = nvm_create_tgt_dev(dev, e.lun_begin, e.lun_end, e.op); if (!tgt_dev) { - pr_err("nvm: could not create target device\n"); + pr_err("could not create target device\n"); ret = -ENOMEM; goto err_t; } @@ -493,8 +495,11 @@ static int nvm_remove_tgt(struct nvm_ioctl_remove *remove) } up_read(&nvm_lock); - if (!t) + if (!t) { + pr_err("failed to remove target %s\n", + remove->tgtname); return 1; + } __nvm_remove_target(t, true); kref_put(&dev->ref, nvm_free); @@ -686,7 +691,7 @@ static int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd, rqd->nr_ppas = nr_ppas; rqd->ppa_list = nvm_dev_dma_alloc(dev, GFP_KERNEL, &rqd->dma_ppa_list); if (!rqd->ppa_list) { - pr_err("nvm: failed to allocate dma memory\n"); + pr_err("failed to allocate dma memory\n"); return -ENOMEM; } @@ -731,7 +736,7 @@ static int nvm_set_flags(struct nvm_geo *geo, struct nvm_rq *rqd) return flags; } -int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) +int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd, void *buf) { struct nvm_dev *dev = tgt_dev->parent; int ret; @@ -745,19 +750,45 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) rqd->flags = nvm_set_flags(&tgt_dev->geo, rqd); /* In case of error, fail with right address format */ - ret = dev->ops->submit_io(dev, rqd); + ret = dev->ops->submit_io(dev, rqd, buf); if (ret) nvm_rq_dev_to_tgt(tgt_dev, rqd); return ret; } EXPORT_SYMBOL(nvm_submit_io); -int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) +static void nvm_sync_end_io(struct nvm_rq *rqd) +{ + struct completion *waiting = rqd->private; + + complete(waiting); +} + +static int nvm_submit_io_wait(struct nvm_dev *dev, struct nvm_rq *rqd, + void *buf) +{ + DECLARE_COMPLETION_ONSTACK(wait); + int ret = 0; + + rqd->end_io = nvm_sync_end_io; + rqd->private = &wait; + + ret = dev->ops->submit_io(dev, rqd, buf); + if (ret) + return ret; + + wait_for_completion_io(&wait); + + return 0; +} + +int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd, + void *buf) { struct nvm_dev *dev = tgt_dev->parent; int ret; - if (!dev->ops->submit_io_sync) + if (!dev->ops->submit_io) return -ENODEV; nvm_rq_tgt_to_dev(tgt_dev, rqd); @@ -765,9 +796,7 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) rqd->dev = tgt_dev; rqd->flags = nvm_set_flags(&tgt_dev->geo, rqd); - /* In case of error, fail with right address format */ - ret = dev->ops->submit_io_sync(dev, rqd); - nvm_rq_dev_to_tgt(tgt_dev, rqd); + ret = nvm_submit_io_wait(dev, rqd, buf); return ret; } @@ -788,12 +817,13 @@ EXPORT_SYMBOL(nvm_end_io); static int nvm_submit_io_sync_raw(struct nvm_dev *dev, struct nvm_rq *rqd) { - if (!dev->ops->submit_io_sync) + if (!dev->ops->submit_io) return -ENODEV; + rqd->dev = NULL; rqd->flags = nvm_set_flags(&dev->geo, rqd); - return dev->ops->submit_io_sync(dev, rqd); + return nvm_submit_io_wait(dev, rqd, NULL); } static int nvm_bb_chunk_sense(struct nvm_dev *dev, struct ppa_addr ppa) @@ -1048,7 +1078,7 @@ int nvm_set_chunk_meta(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas, return 0; if (nr_ppas > NVM_MAX_VLBA) { - pr_err("nvm: unable to update all blocks atomically\n"); + pr_err("unable to update all blocks atomically\n"); return -EINVAL; } @@ -1111,27 +1141,26 @@ static int nvm_init(struct nvm_dev *dev) int ret = -EINVAL; if (dev->ops->identity(dev)) { - pr_err("nvm: device could not be identified\n"); + pr_err("device could not be identified\n"); goto err; } - pr_debug("nvm: ver:%u.%u nvm_vendor:%x\n", - geo->major_ver_id, geo->minor_ver_id, - geo->vmnt); + pr_debug("ver:%u.%u nvm_vendor:%x\n", geo->major_ver_id, + geo->minor_ver_id, geo->vmnt); ret = nvm_core_init(dev); if (ret) { - pr_err("nvm: could not initialize core structures.\n"); + pr_err("could not initialize core structures.\n"); goto err; } - pr_info("nvm: registered %s [%u/%u/%u/%u/%u]\n", + pr_info("registered %s [%u/%u/%u/%u/%u]\n", dev->name, dev->geo.ws_min, dev->geo.ws_opt, dev->geo.num_chk, dev->geo.all_luns, dev->geo.num_ch); return 0; err: - pr_err("nvm: failed to initialize nvm\n"); + pr_err("failed to initialize nvm\n"); return ret; } @@ -1169,7 +1198,7 @@ int nvm_register(struct nvm_dev *dev) dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist", exp_pool_size); if (!dev->dma_pool) { - pr_err("nvm: could not create dma pool\n"); + pr_err("could not create dma pool\n"); kref_put(&dev->ref, nvm_free); return -ENOMEM; } @@ -1214,7 +1243,7 @@ static int __nvm_configure_create(struct nvm_ioctl_create *create) up_write(&nvm_lock); if (!dev) { - pr_err("nvm: device not found\n"); + pr_err("device not found\n"); return -EINVAL; } @@ -1288,7 +1317,7 @@ static long nvm_ioctl_get_devices(struct file *file, void __user *arg) i++; if (i > 31) { - pr_err("nvm: max 31 devices can be reported.\n"); + pr_err("max 31 devices can be reported.\n"); break; } } @@ -1315,7 +1344,7 @@ static long nvm_ioctl_dev_create(struct file *file, void __user *arg) if (create.conf.type == NVM_CONFIG_TYPE_EXTENDED && create.conf.e.rsv != 0) { - pr_err("nvm: reserved config field in use\n"); + pr_err("reserved config field in use\n"); return -EINVAL; } @@ -1331,7 +1360,7 @@ static long nvm_ioctl_dev_create(struct file *file, void __user *arg) flags &= ~NVM_TARGET_FACTORY; if (flags) { - pr_err("nvm: flag not supported\n"); + pr_err("flag not supported\n"); return -EINVAL; } } @@ -1349,7 +1378,7 @@ static long nvm_ioctl_dev_remove(struct file *file, void __user *arg) remove.tgtname[DISK_NAME_LEN - 1] = '\0'; if (remove.flags != 0) { - pr_err("nvm: no flags supported\n"); + pr_err("no flags supported\n"); return -EINVAL; } @@ -1365,7 +1394,7 @@ static long nvm_ioctl_dev_init(struct file *file, void __user *arg) return -EFAULT; if (init.flags != 0) { - pr_err("nvm: no flags supported\n"); + pr_err("no flags supported\n"); return -EINVAL; } diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index f546e6f28b8a..b413bafe93fd 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -507,7 +507,7 @@ void pblk_set_sec_per_write(struct pblk *pblk, int sec_per_write) pblk->sec_per_write = sec_per_write; } -int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd) +int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd, void *buf) { struct nvm_tgt_dev *dev = pblk->dev; @@ -518,7 +518,7 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd) return NVM_IO_ERR; #endif - return nvm_submit_io(dev, rqd); + return nvm_submit_io(dev, rqd, buf); } void pblk_check_chunk_state_update(struct pblk *pblk, struct nvm_rq *rqd) @@ -541,7 +541,7 @@ void pblk_check_chunk_state_update(struct pblk *pblk, struct nvm_rq *rqd) } } -int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd) +int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd, void *buf) { struct nvm_tgt_dev *dev = pblk->dev; int ret; @@ -553,7 +553,7 @@ int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd) return NVM_IO_ERR; #endif - ret = nvm_submit_io_sync(dev, rqd); + ret = nvm_submit_io_sync(dev, rqd, buf); if (trace_pblk_chunk_state_enabled() && !ret && rqd->opcode == NVM_OP_PWRITE) @@ -562,65 +562,19 @@ int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd) return ret; } -int pblk_submit_io_sync_sem(struct pblk *pblk, struct nvm_rq *rqd) +static int pblk_submit_io_sync_sem(struct pblk *pblk, struct nvm_rq *rqd, + void *buf) { struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd); int ret; pblk_down_chunk(pblk, ppa_list[0]); - ret = pblk_submit_io_sync(pblk, rqd); + ret = pblk_submit_io_sync(pblk, rqd, buf); pblk_up_chunk(pblk, ppa_list[0]); return ret; } -static void pblk_bio_map_addr_endio(struct bio *bio) -{ - bio_put(bio); -} - -struct bio *pblk_bio_map_addr(struct pblk *pblk, void *data, - unsigned int nr_secs, unsigned int len, - int alloc_type, gfp_t gfp_mask) -{ - struct nvm_tgt_dev *dev = pblk->dev; - void *kaddr = data; - struct page *page; - struct bio *bio; - int i, ret; - - if (alloc_type == PBLK_KMALLOC_META) - return bio_map_kern(dev->q, kaddr, len, gfp_mask); - - bio = bio_kmalloc(gfp_mask, nr_secs); - if (!bio) - return ERR_PTR(-ENOMEM); - - for (i = 0; i < nr_secs; i++) { - page = vmalloc_to_page(kaddr); - if (!page) { - pblk_err(pblk, "could not map vmalloc bio\n"); - bio_put(bio); - bio = ERR_PTR(-ENOMEM); - goto out; - } - - ret = bio_add_pc_page(dev->q, bio, page, PAGE_SIZE, 0); - if (ret != PAGE_SIZE) { - pblk_err(pblk, "could not add page to bio\n"); - bio_put(bio); - bio = ERR_PTR(-ENOMEM); - goto out; - } - - kaddr += PAGE_SIZE; - } - - bio->bi_end_io = pblk_bio_map_addr_endio; -out: - return bio; -} - int pblk_calc_secs(struct pblk *pblk, unsigned long secs_avail, unsigned long secs_to_flush, bool skip_meta) { @@ -722,9 +676,7 @@ u64 pblk_line_smeta_start(struct pblk *pblk, struct pblk_line *line) int pblk_line_smeta_read(struct pblk *pblk, struct pblk_line *line) { - struct nvm_tgt_dev *dev = pblk->dev; struct pblk_line_meta *lm = &pblk->lm; - struct bio *bio; struct ppa_addr *ppa_list; struct nvm_rq rqd; u64 paddr = pblk_line_smeta_start(pblk, line); @@ -736,16 +688,6 @@ int pblk_line_smeta_read(struct pblk *pblk, struct pblk_line *line) if (ret) return ret; - bio = bio_map_kern(dev->q, line->smeta, lm->smeta_len, GFP_KERNEL); - if (IS_ERR(bio)) { - ret = PTR_ERR(bio); - goto clear_rqd; - } - - bio->bi_iter.bi_sector = 0; /* internal bio */ - bio_set_op_attrs(bio, REQ_OP_READ, 0); - - rqd.bio = bio; rqd.opcode = NVM_OP_PREAD; rqd.nr_ppas = lm->smeta_sec; rqd.is_seq = 1; @@ -754,10 +696,9 @@ int pblk_line_smeta_read(struct pblk *pblk, struct pblk_line *line) for (i = 0; i < lm->smeta_sec; i++, paddr++) ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id); - ret = pblk_submit_io_sync(pblk, &rqd); + ret = pblk_submit_io_sync(pblk, &rqd, line->smeta); if (ret) { pblk_err(pblk, "smeta I/O submission failed: %d\n", ret); - bio_put(bio); goto clear_rqd; } @@ -776,9 +717,7 @@ clear_rqd: static int pblk_line_smeta_write(struct pblk *pblk, struct pblk_line *line, u64 paddr) { - struct nvm_tgt_dev *dev = pblk->dev; struct pblk_line_meta *lm = &pblk->lm; - struct bio *bio; struct ppa_addr *ppa_list; struct nvm_rq rqd; __le64 *lba_list = emeta_to_lbas(pblk, line->emeta->buf); @@ -791,16 +730,6 @@ static int pblk_line_smeta_write(struct pblk *pblk, struct pblk_line *line, if (ret) return ret; - bio = bio_map_kern(dev->q, line->smeta, lm->smeta_len, GFP_KERNEL); - if (IS_ERR(bio)) { - ret = PTR_ERR(bio); - goto clear_rqd; - } - - bio->bi_iter.bi_sector = 0; /* internal bio */ - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); - - rqd.bio = bio; rqd.opcode = NVM_OP_PWRITE; rqd.nr_ppas = lm->smeta_sec; rqd.is_seq = 1; @@ -814,10 +743,9 @@ static int pblk_line_smeta_write(struct pblk *pblk, struct pblk_line *line, meta->lba = lba_list[paddr] = addr_empty; } - ret = pblk_submit_io_sync_sem(pblk, &rqd); + ret = pblk_submit_io_sync_sem(pblk, &rqd, line->smeta); if (ret) { pblk_err(pblk, "smeta I/O submission failed: %d\n", ret); - bio_put(bio); goto clear_rqd; } @@ -838,10 +766,8 @@ int pblk_line_emeta_read(struct pblk *pblk, struct pblk_line *line, { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = &dev->geo; - struct pblk_line_mgmt *l_mg = &pblk->l_mg; struct pblk_line_meta *lm = &pblk->lm; void *ppa_list_buf, *meta_list; - struct bio *bio; struct ppa_addr *ppa_list; struct nvm_rq rqd; u64 paddr = line->emeta_ssec; @@ -867,17 +793,6 @@ next_rq: rq_ppas = pblk_calc_secs(pblk, left_ppas, 0, false); rq_len = rq_ppas * geo->csecs; - bio = pblk_bio_map_addr(pblk, emeta_buf, rq_ppas, rq_len, - l_mg->emeta_alloc_type, GFP_KERNEL); - if (IS_ERR(bio)) { - ret = PTR_ERR(bio); - goto free_rqd_dma; - } - - bio->bi_iter.bi_sector = 0; /* internal bio */ - bio_set_op_attrs(bio, REQ_OP_READ, 0); - - rqd.bio = bio; rqd.meta_list = meta_list; rqd.ppa_list = ppa_list_buf; rqd.dma_meta_list = dma_meta_list; @@ -896,7 +811,6 @@ next_rq: while (test_bit(pos, line->blk_bitmap)) { paddr += min; if (pblk_boundary_paddr_checks(pblk, paddr)) { - bio_put(bio); ret = -EINTR; goto free_rqd_dma; } @@ -906,7 +820,6 @@ next_rq: } if (pblk_boundary_paddr_checks(pblk, paddr + min)) { - bio_put(bio); ret = -EINTR; goto free_rqd_dma; } @@ -915,10 +828,9 @@ next_rq: ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line_id); } - ret = pblk_submit_io_sync(pblk, &rqd); + ret = pblk_submit_io_sync(pblk, &rqd, emeta_buf); if (ret) { pblk_err(pblk, "emeta I/O submission failed: %d\n", ret); - bio_put(bio); goto free_rqd_dma; } @@ -963,7 +875,7 @@ static int pblk_blk_erase_sync(struct pblk *pblk, struct ppa_addr ppa) /* The write thread schedules erases so that it minimizes disturbances * with writes. Thus, there is no need to take the LUN semaphore. */ - ret = pblk_submit_io_sync(pblk, &rqd); + ret = pblk_submit_io_sync(pblk, &rqd, NULL); rqd.private = pblk; __pblk_end_io_erase(pblk, &rqd); @@ -1792,7 +1704,7 @@ int pblk_blk_erase_async(struct pblk *pblk, struct ppa_addr ppa) /* The write thread schedules erases so that it minimizes disturbances * with writes. Thus, there is no need to take the LUN semaphore. */ - err = pblk_submit_io(pblk, rqd); + err = pblk_submit_io(pblk, rqd, NULL); if (err) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = &dev->geo; @@ -1923,13 +1835,11 @@ void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line) static void pblk_save_lba_list(struct pblk *pblk, struct pblk_line *line) { struct pblk_line_meta *lm = &pblk->lm; - struct pblk_line_mgmt *l_mg = &pblk->l_mg; unsigned int lba_list_size = lm->emeta_len[2]; struct pblk_w_err_gc *w_err_gc = line->w_err_gc; struct pblk_emeta *emeta = line->emeta; - w_err_gc->lba_list = pblk_malloc(lba_list_size, - l_mg->emeta_alloc_type, GFP_KERNEL); + w_err_gc->lba_list = kvmalloc(lba_list_size, GFP_KERNEL); memcpy(w_err_gc->lba_list, emeta_to_lbas(pblk, emeta->buf), lba_list_size); } diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c index 63ee205b41c4..2581eebcfc41 100644 --- a/drivers/lightnvm/pblk-gc.c +++ b/drivers/lightnvm/pblk-gc.c @@ -132,14 +132,12 @@ static __le64 *get_lba_list_from_emeta(struct pblk *pblk, struct pblk_line *line) { struct line_emeta *emeta_buf; - struct pblk_line_mgmt *l_mg = &pblk->l_mg; struct pblk_line_meta *lm = &pblk->lm; unsigned int lba_list_size = lm->emeta_len[2]; __le64 *lba_list; int ret; - emeta_buf = pblk_malloc(lm->emeta_len[0], - l_mg->emeta_alloc_type, GFP_KERNEL); + emeta_buf = kvmalloc(lm->emeta_len[0], GFP_KERNEL); if (!emeta_buf) return NULL; @@ -147,7 +145,7 @@ static __le64 *get_lba_list_from_emeta(struct pblk *pblk, if (ret) { pblk_err(pblk, "line %d read emeta failed (%d)\n", line->id, ret); - pblk_mfree(emeta_buf, l_mg->emeta_alloc_type); + kvfree(emeta_buf); return NULL; } @@ -161,16 +159,16 @@ static __le64 *get_lba_list_from_emeta(struct pblk *pblk, if (ret) { pblk_err(pblk, "inconsistent emeta (line %d)\n", line->id); - pblk_mfree(emeta_buf, l_mg->emeta_alloc_type); + kvfree(emeta_buf); return NULL; } - lba_list = pblk_malloc(lba_list_size, - l_mg->emeta_alloc_type, GFP_KERNEL); + lba_list = kvmalloc(lba_list_size, GFP_KERNEL); + if (lba_list) memcpy(lba_list, emeta_to_lbas(pblk, emeta_buf), lba_list_size); - pblk_mfree(emeta_buf, l_mg->emeta_alloc_type); + kvfree(emeta_buf); return lba_list; } @@ -181,7 +179,6 @@ static void pblk_gc_line_prepare_ws(struct work_struct *work) ws); struct pblk *pblk = line_ws->pblk; struct pblk_line *line = line_ws->line; - struct pblk_line_mgmt *l_mg = &pblk->l_mg; struct pblk_line_meta *lm = &pblk->lm; struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = &dev->geo; @@ -272,7 +269,7 @@ next_rq: goto next_rq; out: - pblk_mfree(lba_list, l_mg->emeta_alloc_type); + kvfree(lba_list); kfree(line_ws); kfree(invalid_bitmap); @@ -286,7 +283,7 @@ fail_free_gc_data: fail_free_gc_rq: kfree(gc_rq); fail_free_lba_list: - pblk_mfree(lba_list, l_mg->emeta_alloc_type); + kvfree(lba_list); fail_free_invalid_bitmap: kfree(invalid_bitmap); fail_free_ws: diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index b351c7f002de..9a967a2e83dd 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -543,7 +543,7 @@ static void pblk_line_mg_free(struct pblk *pblk) for (i = 0; i < PBLK_DATA_LINES; i++) { kfree(l_mg->sline_meta[i]); - pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type); + kvfree(l_mg->eline_meta[i]->buf); kfree(l_mg->eline_meta[i]); } @@ -560,7 +560,7 @@ static void pblk_line_meta_free(struct pblk_line_mgmt *l_mg, kfree(line->erase_bitmap); kfree(line->chks); - pblk_mfree(w_err_gc->lba_list, l_mg->emeta_alloc_type); + kvfree(w_err_gc->lba_list); kfree(w_err_gc); } @@ -890,29 +890,14 @@ static int pblk_line_mg_init(struct pblk *pblk) if (!emeta) goto fail_free_emeta; - if (lm->emeta_len[0] > KMALLOC_MAX_CACHE_SIZE) { - l_mg->emeta_alloc_type = PBLK_VMALLOC_META; - - emeta->buf = vmalloc(lm->emeta_len[0]); - if (!emeta->buf) { - kfree(emeta); - goto fail_free_emeta; - } - - emeta->nr_entries = lm->emeta_sec[0]; - l_mg->eline_meta[i] = emeta; - } else { - l_mg->emeta_alloc_type = PBLK_KMALLOC_META; - - emeta->buf = kmalloc(lm->emeta_len[0], GFP_KERNEL); - if (!emeta->buf) { - kfree(emeta); - goto fail_free_emeta; - } - - emeta->nr_entries = lm->emeta_sec[0]; - l_mg->eline_meta[i] = emeta; + emeta->buf = kvmalloc(lm->emeta_len[0], GFP_KERNEL); + if (!emeta->buf) { + kfree(emeta); + goto fail_free_emeta; } + + emeta->nr_entries = lm->emeta_sec[0]; + l_mg->eline_meta[i] = emeta; } for (i = 0; i < l_mg->nr_lines; i++) @@ -926,10 +911,7 @@ static int pblk_line_mg_init(struct pblk *pblk) fail_free_emeta: while (--i >= 0) { - if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META) - vfree(l_mg->eline_meta[i]->buf); - else - kfree(l_mg->eline_meta[i]->buf); + kvfree(l_mg->eline_meta[i]->buf); kfree(l_mg->eline_meta[i]); } diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c index d98ea392fe33..8efd14e683dc 100644 --- a/drivers/lightnvm/pblk-read.c +++ b/drivers/lightnvm/pblk-read.c @@ -342,7 +342,7 @@ split_retry: bio_put(int_bio); int_bio = bio_clone_fast(bio, GFP_KERNEL, &pblk_bio_set); goto split_retry; - } else if (pblk_submit_io(pblk, rqd)) { + } else if (pblk_submit_io(pblk, rqd, NULL)) { /* Submitting IO to drive failed, let's report an error */ rqd->error = -ENODEV; pblk_end_io_read(rqd); @@ -417,11 +417,7 @@ out: int pblk_submit_read_gc(struct pblk *pblk, struct pblk_gc_rq *gc_rq) { - struct nvm_tgt_dev *dev = pblk->dev; - struct nvm_geo *geo = &dev->geo; - struct bio *bio; struct nvm_rq rqd; - int data_len; int ret = NVM_IO_OK; memset(&rqd, 0, sizeof(struct nvm_rq)); @@ -446,26 +442,12 @@ int pblk_submit_read_gc(struct pblk *pblk, struct pblk_gc_rq *gc_rq) if (!(gc_rq->secs_to_gc)) goto out; - data_len = (gc_rq->secs_to_gc) * geo->csecs; - bio = pblk_bio_map_addr(pblk, gc_rq->data, gc_rq->secs_to_gc, data_len, - PBLK_VMALLOC_META, GFP_KERNEL); - if (IS_ERR(bio)) { - pblk_err(pblk, "could not allocate GC bio (%lu)\n", - PTR_ERR(bio)); - ret = PTR_ERR(bio); - goto err_free_dma; - } - - bio->bi_iter.bi_sector = 0; /* internal bio */ - bio_set_op_attrs(bio, REQ_OP_READ, 0); - rqd.opcode = NVM_OP_PREAD; rqd.nr_ppas = gc_rq->secs_to_gc; - rqd.bio = bio; - if (pblk_submit_io_sync(pblk, &rqd)) { + if (pblk_submit_io_sync(pblk, &rqd, gc_rq->data)) { ret = -EIO; - goto err_free_bio; + goto err_free_dma; } pblk_read_check_rand(pblk, &rqd, gc_rq->lba_list, gc_rq->nr_secs); @@ -489,8 +471,6 @@ out: pblk_free_rqd_meta(pblk, &rqd); return ret; -err_free_bio: - bio_put(bio); err_free_dma: pblk_free_rqd_meta(pblk, &rqd); return ret; diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index e6dda04de144..299ef47a17b2 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -178,12 +178,11 @@ static int pblk_recov_pad_line(struct pblk *pblk, struct pblk_line *line, void *meta_list; struct pblk_pad_rq *pad_rq; struct nvm_rq *rqd; - struct bio *bio; struct ppa_addr *ppa_list; void *data; __le64 *lba_list = emeta_to_lbas(pblk, line->emeta->buf); u64 w_ptr = line->cur_sec; - int left_line_ppas, rq_ppas, rq_len; + int left_line_ppas, rq_ppas; int i, j; int ret = 0; @@ -212,28 +211,15 @@ next_pad_rq: goto fail_complete; } - rq_len = rq_ppas * geo->csecs; - - bio = pblk_bio_map_addr(pblk, data, rq_ppas, rq_len, - PBLK_VMALLOC_META, GFP_KERNEL); - if (IS_ERR(bio)) { - ret = PTR_ERR(bio); - goto fail_complete; - } - - bio->bi_iter.bi_sector = 0; /* internal bio */ - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); - rqd = pblk_alloc_rqd(pblk, PBLK_WRITE_INT); ret = pblk_alloc_rqd_meta(pblk, rqd); if (ret) { pblk_free_rqd(pblk, rqd, PBLK_WRITE_INT); - bio_put(bio); goto fail_complete; } - rqd->bio = bio; + rqd->bio = NULL; rqd->opcode = NVM_OP_PWRITE; rqd->is_seq = 1; rqd->nr_ppas = rq_ppas; @@ -275,13 +261,12 @@ next_pad_rq: kref_get(&pad_rq->ref); pblk_down_chunk(pblk, ppa_list[0]); - ret = pblk_submit_io(pblk, rqd); + ret = pblk_submit_io(pblk, rqd, data); if (ret) { pblk_err(pblk, "I/O submission failed: %d\n", ret); pblk_up_chunk(pblk, ppa_list[0]); kref_put(&pad_rq->ref, pblk_recov_complete); pblk_free_rqd(pblk, rqd, PBLK_WRITE_INT); - bio_put(bio); goto fail_complete; } @@ -375,13 +360,12 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, struct ppa_addr *ppa_list; void *meta_list; struct nvm_rq *rqd; - struct bio *bio; void *data; dma_addr_t dma_ppa_list, dma_meta_list; __le64 *lba_list; u64 paddr = pblk_line_smeta_start(pblk, line) + lm->smeta_sec; bool padded = false; - int rq_ppas, rq_len; + int rq_ppas; int i, j; int ret; u64 left_ppas = pblk_sec_in_open_line(pblk, line) - lm->smeta_sec; @@ -404,18 +388,9 @@ next_rq: rq_ppas = pblk_calc_secs(pblk, left_ppas, 0, false); if (!rq_ppas) rq_ppas = pblk->min_write_pgs; - rq_len = rq_ppas * geo->csecs; retry_rq: - bio = bio_map_kern(dev->q, data, rq_len, GFP_KERNEL); - if (IS_ERR(bio)) - return PTR_ERR(bio); - - bio->bi_iter.bi_sector = 0; /* internal bio */ - bio_set_op_attrs(bio, REQ_OP_READ, 0); - bio_get(bio); - - rqd->bio = bio; + rqd->bio = NULL; rqd->opcode = NVM_OP_PREAD; rqd->meta_list = meta_list; rqd->nr_ppas = rq_ppas; @@ -445,10 +420,9 @@ retry_rq: addr_to_gen_ppa(pblk, paddr + j, line->id); } - ret = pblk_submit_io_sync(pblk, rqd); + ret = pblk_submit_io_sync(pblk, rqd, data); if (ret) { pblk_err(pblk, "I/O submission failed: %d\n", ret); - bio_put(bio); return ret; } @@ -460,24 +434,20 @@ retry_rq: if (padded) { pblk_log_read_err(pblk, rqd); - bio_put(bio); return -EINTR; } pad_distance = pblk_pad_distance(pblk, line); ret = pblk_recov_pad_line(pblk, line, pad_distance); if (ret) { - bio_put(bio); return ret; } padded = true; - bio_put(bio); goto retry_rq; } pblk_get_packed_meta(pblk, rqd); - bio_put(bio); for (i = 0; i < rqd->nr_ppas; i++) { struct pblk_sec_meta *meta = pblk_get_meta(pblk, meta_list, i); diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c index 4e63f9b5954c..b9a2aeba95ab 100644 --- a/drivers/lightnvm/pblk-write.c +++ b/drivers/lightnvm/pblk-write.c @@ -373,7 +373,6 @@ int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line) struct pblk_emeta *emeta = meta_line->emeta; struct ppa_addr *ppa_list; struct pblk_g_ctx *m_ctx; - struct bio *bio; struct nvm_rq *rqd; void *data; u64 paddr; @@ -391,20 +390,9 @@ int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line) rq_len = rq_ppas * geo->csecs; data = ((void *)emeta->buf) + emeta->mem; - bio = pblk_bio_map_addr(pblk, data, rq_ppas, rq_len, - l_mg->emeta_alloc_type, GFP_KERNEL); - if (IS_ERR(bio)) { - pblk_err(pblk, "failed to map emeta io"); - ret = PTR_ERR(bio); - goto fail_free_rqd; - } - bio->bi_iter.bi_sector = 0; /* internal bio */ - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); - rqd->bio = bio; - ret = pblk_alloc_w_rq(pblk, rqd, rq_ppas, pblk_end_io_write_meta); if (ret) - goto fail_free_bio; + goto fail_free_rqd; ppa_list = nvm_rq_to_ppa_list(rqd); for (i = 0; i < rqd->nr_ppas; ) { @@ -423,7 +411,7 @@ int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line) pblk_down_chunk(pblk, ppa_list[0]); - ret = pblk_submit_io(pblk, rqd); + ret = pblk_submit_io(pblk, rqd, data); if (ret) { pblk_err(pblk, "emeta I/O submission failed: %d\n", ret); goto fail_rollback; @@ -437,8 +425,6 @@ fail_rollback: pblk_dealloc_page(pblk, meta_line, rq_ppas); list_add(&meta_line->list, &meta_line->list); spin_unlock(&l_mg->close_lock); -fail_free_bio: - bio_put(bio); fail_free_rqd: pblk_free_rqd(pblk, rqd, PBLK_WRITE_INT); return ret; @@ -523,7 +509,7 @@ static int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd) meta_line = pblk_should_submit_meta_io(pblk, rqd); /* Submit data write for current data line */ - err = pblk_submit_io(pblk, rqd); + err = pblk_submit_io(pblk, rqd, NULL); if (err) { pblk_err(pblk, "data I/O submission failed: %d\n", err); return NVM_IO_ERR; diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index a67855387f53..86ffa875bfe1 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -482,11 +482,6 @@ struct pblk_line { #define PBLK_DATA_LINES 4 enum { - PBLK_KMALLOC_META = 1, - PBLK_VMALLOC_META = 2, -}; - -enum { PBLK_EMETA_TYPE_HEADER = 1, /* struct line_emeta first sector */ PBLK_EMETA_TYPE_LLBA = 2, /* lba list - type: __le64 */ PBLK_EMETA_TYPE_VSC = 3, /* vsc list - type: __le32 */ @@ -521,9 +516,6 @@ struct pblk_line_mgmt { __le32 *vsc_list; /* Valid sector counts for all lines */ - /* Metadata allocation type: VMALLOC | KMALLOC */ - int emeta_alloc_type; - /* Pre-allocated metadata for data lines */ struct pblk_smeta *sline_meta[PBLK_DATA_LINES]; struct pblk_emeta *eline_meta[PBLK_DATA_LINES]; @@ -783,14 +775,10 @@ struct nvm_chk_meta *pblk_chunk_get_off(struct pblk *pblk, struct ppa_addr ppa); void pblk_log_write_err(struct pblk *pblk, struct nvm_rq *rqd); void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd); -int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd); -int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd); -int pblk_submit_io_sync_sem(struct pblk *pblk, struct nvm_rq *rqd); +int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd, void *buf); +int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd, void *buf); int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line); void pblk_check_chunk_state_update(struct pblk *pblk, struct nvm_rq *rqd); -struct bio *pblk_bio_map_addr(struct pblk *pblk, void *data, - unsigned int nr_secs, unsigned int len, - int alloc_type, gfp_t gfp_mask); struct pblk_line *pblk_line_get(struct pblk *pblk); struct pblk_line *pblk_line_get_first_data(struct pblk *pblk); struct pblk_line *pblk_line_replace_data(struct pblk *pblk); @@ -938,21 +926,6 @@ void pblk_rl_werr_line_out(struct pblk_rl *rl); int pblk_sysfs_init(struct gendisk *tdisk); void pblk_sysfs_exit(struct gendisk *tdisk); -static inline void *pblk_malloc(size_t size, int type, gfp_t flags) -{ - if (type == PBLK_KMALLOC_META) - return kmalloc(size, flags); - return vmalloc(size); -} - -static inline void pblk_mfree(void *ptr, int type) -{ - if (type == PBLK_KMALLOC_META) - kfree(ptr); - else - vfree(ptr); -} - static inline struct nvm_rq *nvm_rq_from_c_ctx(void *c_ctx) { return c_ctx - sizeof(struct nvm_rq); diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c index 73f5319295bc..c12cd809ab19 100644 --- a/drivers/md/bcache/closure.c +++ b/drivers/md/bcache/closure.c @@ -105,8 +105,14 @@ struct closure_syncer { static void closure_sync_fn(struct closure *cl) { - cl->s->done = 1; - wake_up_process(cl->s->task); + struct closure_syncer *s = cl->s; + struct task_struct *p; + + rcu_read_lock(); + p = READ_ONCE(s->task); + s->done = 1; + wake_up_process(p); + rcu_read_unlock(); } void __sched __closure_sync(struct closure *cl) diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 8b123be05254..336f43910383 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -178,10 +178,9 @@ static ssize_t bch_dump_read(struct file *file, char __user *buf, while (size) { struct keybuf_key *w; unsigned int bytes = min(i->bytes, size); - int err = copy_to_user(buf, i->buf, bytes); - if (err) - return err; + if (copy_to_user(buf, i->buf, bytes)) + return -EFAULT; ret += bytes; buf += bytes; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index e2059af90791..627dcea0f5b6 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -964,6 +964,7 @@ KTYPE(bch_cache_set_internal); static int __bch_cache_cmp(const void *l, const void *r) { + cond_resched(); return *((uint16_t *)r) - *((uint16_t *)l); } diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index c9e44ac1f9a6..3f8577e2c13b 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -408,6 +408,7 @@ static int map_request(struct dm_rq_target_io *tio) ret = dm_dispatch_clone_request(clone, rq); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { blk_rq_unprep_clone(clone); + blk_mq_cleanup_rq(clone); tio->ti->type->release_clone_rq(clone, &tio->info); tio->clone = NULL; return DM_MAPIO_REQUEUE; @@ -562,7 +563,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t) if (err) goto out_kfree_tag_set; - q = blk_mq_init_allocated_queue(md->tag_set, md->queue); + q = blk_mq_init_allocated_queue(md->tag_set, md->queue, true); if (IS_ERR(q)) { err = PTR_ERR(q); goto out_tag_set; diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c index 7354466ddc90..c766c559d36d 100644 --- a/drivers/md/md-linear.c +++ b/drivers/md/md-linear.c @@ -258,6 +258,11 @@ static bool linear_make_request(struct mddev *mddev, struct bio *bio) bio_sector < start_sector)) goto out_of_bounds; + if (unlikely(is_mddev_broken(tmp_dev->rdev, "linear"))) { + bio_io_error(bio); + return true; + } + if (unlikely(bio_end_sector(bio) > end_sector)) { /* This bio crosses a device boundary, so we have to split it */ struct bio *split = bio_split(bio, end_sector - bio_sector, diff --git a/drivers/md/md.c b/drivers/md/md.c index 24638ccedce4..1be7abeb24fd 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -376,6 +376,11 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio) struct mddev *mddev = q->queuedata; unsigned int sectors; + if (unlikely(test_bit(MD_BROKEN, &mddev->flags)) && (rw == WRITE)) { + bio_io_error(bio); + return BLK_QC_T_NONE; + } + blk_queue_split(q, &bio); if (mddev == NULL || mddev->pers == NULL) { @@ -1232,6 +1237,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev) mddev->new_layout = mddev->layout; mddev->new_chunk_sectors = mddev->chunk_sectors; } + if (mddev->level == 0) + mddev->layout = -1; if (sb->state & (1<<MD_SB_CLEAN)) mddev->recovery_cp = MaxSector; @@ -1647,6 +1654,10 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset; } + if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT) && + sb->level != 0) + return -EINVAL; + if (!refdev) { ret = 1; } else { @@ -1757,6 +1768,10 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) mddev->new_chunk_sectors = mddev->chunk_sectors; } + if (mddev->level == 0 && + !(le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT)) + mddev->layout = -1; + if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL) set_bit(MD_HAS_JOURNAL, &mddev->flags); @@ -1826,8 +1841,15 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) if (!(le32_to_cpu(sb->feature_map) & MD_FEATURE_RECOVERY_BITMAP)) rdev->saved_raid_disk = -1; - } else - set_bit(In_sync, &rdev->flags); + } else { + /* + * If the array is FROZEN, then the device can't + * be in_sync with rest of array. + */ + if (!test_bit(MD_RECOVERY_FROZEN, + &mddev->recovery)) + set_bit(In_sync, &rdev->flags); + } rdev->raid_disk = role; break; } @@ -3664,11 +3686,7 @@ int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale) return -EINVAL; if (decimals < 0) decimals = 0; - while (decimals < scale) { - result *= 10; - decimals ++; - } - *res = result; + *res = result * int_pow(10, scale - decimals); return 0; } @@ -4155,12 +4173,17 @@ __ATTR_PREALLOC(resync_start, S_IRUGO|S_IWUSR, * active-idle * like active, but no writes have been seen for a while (100msec). * + * broken + * RAID0/LINEAR-only: same as clean, but array is missing a member. + * It's useful because RAID0/LINEAR mounted-arrays aren't stopped + * when a member is gone, so this state will at least alert the + * user that something is wrong. */ enum array_state { clear, inactive, suspended, readonly, read_auto, clean, active, - write_pending, active_idle, bad_word}; + write_pending, active_idle, broken, bad_word}; static char *array_states[] = { "clear", "inactive", "suspended", "readonly", "read-auto", "clean", "active", - "write-pending", "active-idle", NULL }; + "write-pending", "active-idle", "broken", NULL }; static int match_word(const char *word, char **list) { @@ -4176,7 +4199,7 @@ array_state_show(struct mddev *mddev, char *page) { enum array_state st = inactive; - if (mddev->pers) + if (mddev->pers && !test_bit(MD_NOT_READY, &mddev->flags)) { switch(mddev->ro) { case 1: st = readonly; @@ -4196,7 +4219,10 @@ array_state_show(struct mddev *mddev, char *page) st = active; spin_unlock(&mddev->lock); } - else { + + if (test_bit(MD_BROKEN, &mddev->flags) && st == clean) + st = broken; + } else { if (list_empty(&mddev->disks) && mddev->raid_disks == 0 && mddev->dev_sectors == 0) @@ -4310,6 +4336,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len) break; case write_pending: case active_idle: + case broken: /* these cannot be set */ break; } @@ -5182,6 +5209,34 @@ static struct md_sysfs_entry md_consistency_policy = __ATTR(consistency_policy, S_IRUGO | S_IWUSR, consistency_policy_show, consistency_policy_store); +static ssize_t fail_last_dev_show(struct mddev *mddev, char *page) +{ + return sprintf(page, "%d\n", mddev->fail_last_dev); +} + +/* + * Setting fail_last_dev to true to allow last device to be forcibly removed + * from RAID1/RAID10. + */ +static ssize_t +fail_last_dev_store(struct mddev *mddev, const char *buf, size_t len) +{ + int ret; + bool value; + + ret = kstrtobool(buf, &value); + if (ret) + return ret; + + if (value != mddev->fail_last_dev) + mddev->fail_last_dev = value; + + return len; +} +static struct md_sysfs_entry md_fail_last_dev = +__ATTR(fail_last_dev, S_IRUGO | S_IWUSR, fail_last_dev_show, + fail_last_dev_store); + static struct attribute *md_default_attrs[] = { &md_level.attr, &md_layout.attr, @@ -5198,6 +5253,7 @@ static struct attribute *md_default_attrs[] = { &md_array_size.attr, &max_corr_read_errors.attr, &md_consistency_policy.attr, + &md_fail_last_dev.attr, NULL, }; @@ -5744,9 +5800,6 @@ int md_run(struct mddev *mddev) md_update_sb(mddev, 0); md_new_event(mddev); - sysfs_notify_dirent_safe(mddev->sysfs_state); - sysfs_notify_dirent_safe(mddev->sysfs_action); - sysfs_notify(&mddev->kobj, NULL, "degraded"); return 0; bitmap_abort: @@ -5767,6 +5820,7 @@ static int do_md_run(struct mddev *mddev) { int err; + set_bit(MD_NOT_READY, &mddev->flags); err = md_run(mddev); if (err) goto out; @@ -5787,9 +5841,14 @@ static int do_md_run(struct mddev *mddev) set_capacity(mddev->gendisk, mddev->array_sectors); revalidate_disk(mddev->gendisk); + clear_bit(MD_NOT_READY, &mddev->flags); mddev->changed = 1; kobject_uevent(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE); + sysfs_notify_dirent_safe(mddev->sysfs_state); + sysfs_notify_dirent_safe(mddev->sysfs_action); + sysfs_notify(&mddev->kobj, NULL, "degraded"); out: + clear_bit(MD_NOT_READY, &mddev->flags); return err; } @@ -6849,6 +6908,9 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info) mddev->external = 0; mddev->layout = info->layout; + if (mddev->level == 0) + /* Cannot trust RAID0 layout info here */ + mddev->layout = -1; mddev->chunk_sectors = info->chunk_size >> 9; if (mddev->persistent) { @@ -8900,6 +8962,7 @@ void md_check_recovery(struct mddev *mddev) if (mddev_trylock(mddev)) { int spares = 0; + bool try_set_sync = mddev->safemode != 0; if (!mddev->external && mddev->safemode == 1) mddev->safemode = 0; @@ -8945,7 +9008,7 @@ void md_check_recovery(struct mddev *mddev) } } - if (!mddev->external && !mddev->in_sync) { + if (try_set_sync && !mddev->external && !mddev->in_sync) { spin_lock(&mddev->lock); set_in_sync(mddev); spin_unlock(&mddev->lock); @@ -9043,7 +9106,8 @@ void md_reap_sync_thread(struct mddev *mddev) /* resync has finished, collect result */ md_unregister_thread(&mddev->sync_thread); if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && - !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { + !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && + mddev->degraded != mddev->raid_disks) { /* success...*/ /* activate any spares */ if (mddev->pers->spare_active(mddev)) { diff --git a/drivers/md/md.h b/drivers/md/md.h index 10f98200e2f8..c5e3ff398b59 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -248,6 +248,12 @@ enum mddev_flags { MD_UPDATING_SB, /* md_check_recovery is updating the metadata * without explicitly holding reconfig_mutex. */ + MD_NOT_READY, /* do_md_run() is active, so 'array_state' + * must not report that array is ready yet + */ + MD_BROKEN, /* This is used in RAID-0/LINEAR only, to stop + * I/O in case an array member is gone/failed. + */ }; enum mddev_sb_flags { @@ -487,6 +493,7 @@ struct mddev { unsigned int good_device_nr; /* good device num within cluster raid */ bool has_superblocks:1; + bool fail_last_dev:1; }; enum recovery_flags { @@ -735,6 +742,19 @@ extern void mddev_create_wb_pool(struct mddev *mddev, struct md_rdev *rdev, struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr); struct md_rdev *md_find_rdev_rcu(struct mddev *mddev, dev_t dev); +static inline bool is_mddev_broken(struct md_rdev *rdev, const char *md_type) +{ + int flags = rdev->bdev->bd_disk->flags; + + if (!(flags & GENHD_FL_UP)) { + if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags)) + pr_warn("md: %s: %s array has a missing/failed member\n", + mdname(rdev->mddev), md_type); + return true; + } + return false; +} + static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) { int faulty = test_bit(Faulty, &rdev->flags); diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index bf5cf184a260..f61693e59684 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -19,6 +19,9 @@ #include "raid0.h" #include "raid5.h" +static int default_layout = 0; +module_param(default_layout, int, 0644); + #define UNSUPPORTED_MDDEV_FLAGS \ ((1L << MD_HAS_JOURNAL) | \ (1L << MD_JOURNAL_CLEAN) | \ @@ -139,6 +142,22 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf) } pr_debug("md/raid0:%s: FINAL %d zones\n", mdname(mddev), conf->nr_strip_zones); + + if (conf->nr_strip_zones == 1) { + conf->layout = RAID0_ORIG_LAYOUT; + } else if (mddev->layout == RAID0_ORIG_LAYOUT || + mddev->layout == RAID0_ALT_MULTIZONE_LAYOUT) { + conf->layout = mddev->layout; + } else if (default_layout == RAID0_ORIG_LAYOUT || + default_layout == RAID0_ALT_MULTIZONE_LAYOUT) { + conf->layout = default_layout; + } else { + pr_err("md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting\n", + mdname(mddev)); + pr_err("md/raid0: please set raid.default_layout to 1 or 2\n"); + err = -ENOTSUPP; + goto abort; + } /* * now since we have the hard sector sizes, we can make sure * chunk size is a multiple of that sector size @@ -547,10 +566,12 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) static bool raid0_make_request(struct mddev *mddev, struct bio *bio) { + struct r0conf *conf = mddev->private; struct strip_zone *zone; struct md_rdev *tmp_dev; sector_t bio_sector; sector_t sector; + sector_t orig_sector; unsigned chunk_sects; unsigned sectors; @@ -584,8 +605,26 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) bio = split; } + orig_sector = sector; zone = find_zone(mddev->private, §or); - tmp_dev = map_sector(mddev, zone, sector, §or); + switch (conf->layout) { + case RAID0_ORIG_LAYOUT: + tmp_dev = map_sector(mddev, zone, orig_sector, §or); + break; + case RAID0_ALT_MULTIZONE_LAYOUT: + tmp_dev = map_sector(mddev, zone, sector, §or); + break; + default: + WARN("md/raid0:%s: Invalid layout\n", mdname(mddev)); + bio_io_error(bio); + return true; + } + + if (unlikely(is_mddev_broken(tmp_dev, "raid0"))) { + bio_io_error(bio); + return true; + } + bio_set_dev(bio, tmp_dev->bdev); bio->bi_iter.bi_sector = sector + zone->dev_start + tmp_dev->data_offset; diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h index 540e65d92642..3816e5477db1 100644 --- a/drivers/md/raid0.h +++ b/drivers/md/raid0.h @@ -8,11 +8,25 @@ struct strip_zone { int nb_dev; /* # of devices attached to the zone */ }; +/* Linux 3.14 (20d0189b101) made an unintended change to + * the RAID0 layout for multi-zone arrays (where devices aren't all + * the same size. + * RAID0_ORIG_LAYOUT restores the original layout + * RAID0_ALT_MULTIZONE_LAYOUT uses the altered layout + * The layouts are identical when there is only one zone (all + * devices the same size). + */ + +enum r0layout { + RAID0_ORIG_LAYOUT = 1, + RAID0_ALT_MULTIZONE_LAYOUT = 2, +}; struct r0conf { struct strip_zone *strip_zone; struct md_rdev **devlist; /* lists of rdevs, pointed to * by strip_zone->dev */ int nr_strip_zones; + enum r0layout layout; }; #endif diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 34e26834ad28..0466ee2453b4 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -447,19 +447,21 @@ static void raid1_end_write_request(struct bio *bio) /* We never try FailFast to WriteMostly devices */ !test_bit(WriteMostly, &rdev->flags)) { md_error(r1_bio->mddev, rdev); - if (!test_bit(Faulty, &rdev->flags)) - /* This is the only remaining device, - * We need to retry the write without - * FailFast - */ - set_bit(R1BIO_WriteError, &r1_bio->state); - else { - /* Finished with this branch */ - r1_bio->bios[mirror] = NULL; - to_put = bio; - } - } else + } + + /* + * When the device is faulty, it is not necessary to + * handle write error. + * For failfast, this is the only remaining device, + * We need to retry the write without FailFast. + */ + if (!test_bit(Faulty, &rdev->flags)) set_bit(R1BIO_WriteError, &r1_bio->state); + else { + /* Finished with this branch */ + r1_bio->bios[mirror] = NULL; + to_put = bio; + } } else { /* * Set R1BIO_Uptodate in our master bio, so that we @@ -872,8 +874,11 @@ static void flush_pending_writes(struct r1conf *conf) * backgroup IO calls must call raise_barrier. Once that returns * there is no normal IO happeing. It must arrange to call * lower_barrier when the particular background IO completes. + * + * If resync/recovery is interrupted, returns -EINTR; + * Otherwise, returns 0. */ -static sector_t raise_barrier(struct r1conf *conf, sector_t sector_nr) +static int raise_barrier(struct r1conf *conf, sector_t sector_nr) { int idx = sector_to_idx(sector_nr); @@ -1612,12 +1617,12 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) /* * If it is not operational, then we have already marked it as dead - * else if it is the last working disks, ignore the error, let the - * next level up know. + * else if it is the last working disks with "fail_last_dev == false", + * ignore the error, let the next level up know. * else mark the drive as failed */ spin_lock_irqsave(&conf->device_lock, flags); - if (test_bit(In_sync, &rdev->flags) + if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev && (conf->raid_disks - mddev->degraded) == 1) { /* * Don't fail the drive, act as though we were just a @@ -1901,6 +1906,22 @@ static void abort_sync_write(struct mddev *mddev, struct r1bio *r1_bio) } while (sectors_to_go > 0); } +static void put_sync_write_buf(struct r1bio *r1_bio, int uptodate) +{ + if (atomic_dec_and_test(&r1_bio->remaining)) { + struct mddev *mddev = r1_bio->mddev; + int s = r1_bio->sectors; + + if (test_bit(R1BIO_MadeGood, &r1_bio->state) || + test_bit(R1BIO_WriteError, &r1_bio->state)) + reschedule_retry(r1_bio); + else { + put_buf(r1_bio); + md_done_sync(mddev, s, uptodate); + } + } +} + static void end_sync_write(struct bio *bio) { int uptodate = !bio->bi_status; @@ -1927,16 +1948,7 @@ static void end_sync_write(struct bio *bio) ) set_bit(R1BIO_MadeGood, &r1_bio->state); - if (atomic_dec_and_test(&r1_bio->remaining)) { - int s = r1_bio->sectors; - if (test_bit(R1BIO_MadeGood, &r1_bio->state) || - test_bit(R1BIO_WriteError, &r1_bio->state)) - reschedule_retry(r1_bio); - else { - put_buf(r1_bio); - md_done_sync(mddev, s, uptodate); - } - } + put_sync_write_buf(r1_bio, uptodate); } static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector, @@ -2219,17 +2231,7 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio) generic_make_request(wbio); } - if (atomic_dec_and_test(&r1_bio->remaining)) { - /* if we're here, all write(s) have completed, so clean up */ - int s = r1_bio->sectors; - if (test_bit(R1BIO_MadeGood, &r1_bio->state) || - test_bit(R1BIO_WriteError, &r1_bio->state)) - reschedule_retry(r1_bio); - else { - put_buf(r1_bio); - md_done_sync(mddev, s, 1); - } - } + put_sync_write_buf(r1_bio, 1); } /* @@ -3127,6 +3129,13 @@ static int raid1_run(struct mddev *mddev) !test_bit(In_sync, &conf->mirrors[i].rdev->flags) || test_bit(Faulty, &conf->mirrors[i].rdev->flags)) mddev->degraded++; + /* + * RAID1 needs at least one disk in active + */ + if (conf->raid_disks - mddev->degraded < 1) { + ret = -EINVAL; + goto abort; + } if (conf->raid_disks - mddev->degraded == 1) mddev->recovery_cp = MaxSector; @@ -3160,8 +3169,12 @@ static int raid1_run(struct mddev *mddev) ret = md_integrity_register(mddev); if (ret) { md_unregister_thread(&mddev->thread); - raid1_free(mddev, conf); + goto abort; } + return 0; + +abort: + raid1_free(mddev, conf); return ret; } diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 8a1354a08a1a..299c7b1c9718 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -465,19 +465,21 @@ static void raid10_end_write_request(struct bio *bio) if (test_bit(FailFast, &rdev->flags) && (bio->bi_opf & MD_FAILFAST)) { md_error(rdev->mddev, rdev); - if (!test_bit(Faulty, &rdev->flags)) - /* This is the only remaining device, - * We need to retry the write without - * FailFast - */ - set_bit(R10BIO_WriteError, &r10_bio->state); - else { - r10_bio->devs[slot].bio = NULL; - to_put = bio; - dec_rdev = 1; - } - } else + } + + /* + * When the device is faulty, it is not necessary to + * handle write error. + * For failfast, this is the only remaining device, + * We need to retry the write without FailFast. + */ + if (!test_bit(Faulty, &rdev->flags)) set_bit(R10BIO_WriteError, &r10_bio->state); + else { + r10_bio->devs[slot].bio = NULL; + to_put = bio; + dec_rdev = 1; + } } } else { /* @@ -1638,12 +1640,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) /* * If it is not operational, then we have already marked it as dead - * else if it is the last working disks, ignore the error, let the - * next level up know. + * else if it is the last working disks with "fail_last_dev == false", + * ignore the error, let the next level up know. * else mark the drive as failed */ spin_lock_irqsave(&conf->device_lock, flags); - if (test_bit(In_sync, &rdev->flags) + if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev && !enough(conf, rdev->raid_disk)) { /* * Don't fail the drive, just return an IO error. diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 3de4e13bde98..223e97ab27e6 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2526,7 +2526,8 @@ static void raid5_end_read_request(struct bio * bi) int set_bad = 0; clear_bit(R5_UPTODATE, &sh->dev[i].flags); - atomic_inc(&rdev->read_errors); + if (!(bi->bi_status == BLK_STS_PROTECTION)) + atomic_inc(&rdev->read_errors); if (test_bit(R5_ReadRepl, &sh->dev[i].flags)) pr_warn_ratelimited( "md/raid:%s: read error on replacement device (sector %llu on %s).\n", @@ -2549,16 +2550,24 @@ static void raid5_end_read_request(struct bio * bi) (unsigned long long)s, bdn); } else if (atomic_read(&rdev->read_errors) - > conf->max_nr_stripes) - pr_warn("md/raid:%s: Too many read errors, failing device %s.\n", - mdname(conf->mddev), bdn); - else + > conf->max_nr_stripes) { + if (!test_bit(Faulty, &rdev->flags)) { + pr_warn("md/raid:%s: %d read_errors > %d stripes\n", + mdname(conf->mddev), + atomic_read(&rdev->read_errors), + conf->max_nr_stripes); + pr_warn("md/raid:%s: Too many read errors, failing device %s.\n", + mdname(conf->mddev), bdn); + } + } else retry = 1; if (set_bad && test_bit(In_sync, &rdev->flags) && !test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) retry = 1; if (retry) - if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) { + if (sh->qd_idx >= 0 && sh->pd_idx == i) + set_bit(R5_ReadError, &sh->dev[i].flags); + else if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) { set_bit(R5_ReadError, &sh->dev[i].flags); clear_bit(R5_ReadNoMerge, &sh->dev[i].flags); } else @@ -4612,7 +4621,6 @@ static void break_stripe_batch_list(struct stripe_head *head_sh, (1 << STRIPE_FULL_WRITE) | (1 << STRIPE_BIOFILL_RUN) | (1 << STRIPE_COMPUTE_RUN) | - (1 << STRIPE_OPS_REQ_PENDING) | (1 << STRIPE_DISCARD) | (1 << STRIPE_BATCH_READY) | (1 << STRIPE_BATCH_ERR) | @@ -5491,7 +5499,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi) return; logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1); - last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9); + last_sector = bio_end_sector(bi); bi->bi_next = NULL; @@ -5718,7 +5726,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) do_flush = false; } - set_bit(STRIPE_HANDLE, &sh->state); + if (!sh->batch_head) + set_bit(STRIPE_HANDLE, &sh->state); clear_bit(STRIPE_DELAYED, &sh->state); if ((!sh->batch_head || sh == sh->batch_head) && (bi->bi_opf & REQ_SYNC) && diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index cf991f13403e..f90e0704bed9 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -357,7 +357,6 @@ enum { STRIPE_FULL_WRITE, /* all blocks are set to be overwritten */ STRIPE_BIOFILL_RUN, STRIPE_COMPUTE_RUN, - STRIPE_OPS_REQ_PENDING, STRIPE_ON_UNPLUG_LIST, STRIPE_DISCARD, STRIPE_ON_RELEASE_LIST, @@ -493,9 +492,7 @@ struct disk_info { */ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector) { - int sectors = bio_sectors(bio); - - if (bio->bi_iter.bi_sector + sectors < sector + STRIPE_SECTORS) + if (bio_end_sector(bio) < sector + STRIPE_SECTORS) return bio->bi_next; else return NULL; diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index ec43ac9199e2..2b36f052bfb9 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -64,6 +64,7 @@ config NVME_TCP depends on INET depends on BLK_DEV_NVME select NVME_FABRICS + select CRYPTO_CRC32C help This provides support for the NVMe over Fabrics protocol using the TCP transport. This allows you to use remote block devices diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index d3d6b7bd6903..1ede1763a5ee 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -22,12 +22,12 @@ #include <linux/pm_qos.h> #include <asm/unaligned.h> -#define CREATE_TRACE_POINTS -#include "trace.h" - #include "nvme.h" #include "fabrics.h" +#define CREATE_TRACE_POINTS +#include "trace.h" + #define NVME_MINORS (1U << MINORBITS) unsigned int admin_timeout = 60; @@ -81,7 +81,6 @@ EXPORT_SYMBOL_GPL(nvme_reset_wq); struct workqueue_struct *nvme_delete_wq; EXPORT_SYMBOL_GPL(nvme_delete_wq); -static DEFINE_IDA(nvme_subsystems_ida); static LIST_HEAD(nvme_subsystems); static DEFINE_MUTEX(nvme_subsystems_lock); @@ -197,9 +196,9 @@ static inline bool nvme_ns_has_pi(struct nvme_ns *ns) return ns->pi_type && ns->ms == sizeof(struct t10_pi_tuple); } -static blk_status_t nvme_error_status(struct request *req) +static blk_status_t nvme_error_status(u16 status) { - switch (nvme_req(req)->status & 0x7ff) { + switch (status & 0x7ff) { case NVME_SC_SUCCESS: return BLK_STS_OK; case NVME_SC_CAP_EXCEEDED: @@ -226,6 +225,8 @@ static blk_status_t nvme_error_status(struct request *req) return BLK_STS_PROTECTION; case NVME_SC_RESERVATION_CONFLICT: return BLK_STS_NEXUS; + case NVME_SC_HOST_PATH_ERROR: + return BLK_STS_TRANSPORT; default: return BLK_STS_IOERR; } @@ -260,7 +261,7 @@ static void nvme_retry_req(struct request *req) void nvme_complete_rq(struct request *req) { - blk_status_t status = nvme_error_status(req); + blk_status_t status = nvme_error_status(nvme_req(req)->status); trace_nvme_complete_rq(req); @@ -279,6 +280,8 @@ void nvme_complete_rq(struct request *req) return; } } + + nvme_trace_bio_complete(req, status); blk_mq_end_request(req, status); } EXPORT_SYMBOL_GPL(nvme_complete_rq); @@ -288,8 +291,12 @@ bool nvme_cancel_request(struct request *req, void *data, bool reserved) dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device, "Cancelling I/O %d", req->tag); - nvme_req(req)->status = NVME_SC_ABORT_REQ; - blk_mq_complete_request_sync(req); + /* don't abort one completed request */ + if (blk_mq_request_completed(req)) + return true; + + nvme_req(req)->status = NVME_SC_HOST_PATH_ERROR; + blk_mq_complete_request(req); return true; } EXPORT_SYMBOL_GPL(nvme_cancel_request); @@ -1088,10 +1095,9 @@ static int nvme_identify_ns_list(struct nvme_ctrl *dev, unsigned nsid, __le32 *n NVME_IDENTIFY_DATA_SIZE); } -static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl, - unsigned nsid) +static int nvme_identify_ns(struct nvme_ctrl *ctrl, + unsigned nsid, struct nvme_id_ns **id) { - struct nvme_id_ns *id; struct nvme_command c = { }; int error; @@ -1100,18 +1106,17 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl, c.identify.nsid = cpu_to_le32(nsid); c.identify.cns = NVME_ID_CNS_NS; - id = kmalloc(sizeof(*id), GFP_KERNEL); - if (!id) - return NULL; + *id = kmalloc(sizeof(**id), GFP_KERNEL); + if (!*id) + return -ENOMEM; - error = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id)); + error = nvme_submit_sync_cmd(ctrl->admin_q, &c, *id, sizeof(**id)); if (error) { dev_warn(ctrl->device, "Identify namespace failed (%d)\n", error); - kfree(id); - return NULL; + kfree(*id); } - return id; + return error; } static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid, @@ -1180,7 +1185,8 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) EXPORT_SYMBOL_GPL(nvme_set_queue_count); #define NVME_AEN_SUPPORTED \ - (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | NVME_AEN_CFG_ANA_CHANGE) + (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \ + NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE) static void nvme_enable_aen(struct nvme_ctrl *ctrl) { @@ -1195,6 +1201,8 @@ static void nvme_enable_aen(struct nvme_ctrl *ctrl) if (status) dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n", supported_aens); + + queue_work(nvme_wq, &ctrl->async_event_work); } static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) @@ -1594,9 +1602,11 @@ static void nvme_config_write_zeroes(struct gendisk *disk, struct nvme_ns *ns) blk_queue_max_write_zeroes_sectors(disk->queue, max_sectors); } -static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, +static int nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, struct nvme_id_ns *id, struct nvme_ns_ids *ids) { + int ret = 0; + memset(ids, 0, sizeof(*ids)); if (ctrl->vs >= NVME_VS(1, 1, 0)) @@ -1607,10 +1617,12 @@ static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, /* Don't treat error as fatal we potentially * already have a NGUID or EUI-64 */ - if (nvme_identify_ns_descs(ctrl, nsid, ids)) + ret = nvme_identify_ns_descs(ctrl, nsid, ids); + if (ret) dev_warn(ctrl->device, - "%s: Identify Descriptors failed\n", __func__); + "Identify Descriptors failed (%d)\n", ret); } + return ret; } static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids) @@ -1738,25 +1750,37 @@ static int nvme_revalidate_disk(struct gendisk *disk) return -ENODEV; } - id = nvme_identify_ns(ctrl, ns->head->ns_id); - if (!id) - return -ENODEV; + ret = nvme_identify_ns(ctrl, ns->head->ns_id, &id); + if (ret) + goto out; if (id->ncap == 0) { ret = -ENODEV; - goto out; + goto free_id; } __nvme_revalidate_disk(disk, id); - nvme_report_ns_ids(ctrl, ns->head->ns_id, id, &ids); + ret = nvme_report_ns_ids(ctrl, ns->head->ns_id, id, &ids); + if (ret) + goto free_id; + if (!nvme_ns_ids_equal(&ns->head->ids, &ids)) { dev_err(ctrl->device, "identifiers changed for nsid %d\n", ns->head->ns_id); ret = -ENODEV; } -out: +free_id: kfree(id); +out: + /* + * Only fail the function if we got a fatal error back from the + * device, otherwise ignore the error and just move on. + */ + if (ret == -ENOMEM || (ret > 0 && !(ret & NVME_SC_DNR))) + ret = 0; + else if (ret > 0) + ret = blk_status_to_errno(nvme_error_status(ret)); return ret; } @@ -1952,7 +1976,7 @@ static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled) * bits', but doing so may cause the device to complete commands to the * admin queue ... and we don't know what memory that might be pointing at! */ -int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap) +int nvme_disable_ctrl(struct nvme_ctrl *ctrl) { int ret; @@ -1966,20 +1990,27 @@ int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap) if (ctrl->quirks & NVME_QUIRK_DELAY_BEFORE_CHK_RDY) msleep(NVME_QUIRK_DELAY_AMOUNT); - return nvme_wait_ready(ctrl, cap, false); + return nvme_wait_ready(ctrl, ctrl->cap, false); } EXPORT_SYMBOL_GPL(nvme_disable_ctrl); -int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap) +int nvme_enable_ctrl(struct nvme_ctrl *ctrl) { /* * Default to a 4K page size, with the intention to update this * path in the future to accomodate architectures with differing * kernel and IO page sizes. */ - unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12, page_shift = 12; + unsigned dev_page_min, page_shift = 12; int ret; + ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap); + if (ret) { + dev_err(ctrl->device, "Reading CAP failed (%d)\n", ret); + return ret; + } + dev_page_min = NVME_CAP_MPSMIN(ctrl->cap) + 12; + if (page_shift < dev_page_min) { dev_err(ctrl->device, "Minimum device page size %u too large for host (%u)\n", @@ -1998,7 +2029,7 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap) ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config); if (ret) return ret; - return nvme_wait_ready(ctrl, cap, true); + return nvme_wait_ready(ctrl, ctrl->cap, true); } EXPORT_SYMBOL_GPL(nvme_enable_ctrl); @@ -2332,7 +2363,8 @@ static void nvme_release_subsystem(struct device *dev) struct nvme_subsystem *subsys = container_of(dev, struct nvme_subsystem, dev); - ida_simple_remove(&nvme_subsystems_ida, subsys->instance); + if (subsys->instance >= 0) + ida_simple_remove(&nvme_instance_ida, subsys->instance); kfree(subsys); } @@ -2361,6 +2393,17 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn) lockdep_assert_held(&nvme_subsystems_lock); + /* + * Fail matches for discovery subsystems. This results + * in each discovery controller bound to a unique subsystem. + * This avoids issues with validating controller values + * that can only be true when there is a single unique subsystem. + * There may be multiple and completely independent entities + * that provide discovery controllers. + */ + if (!strcmp(subsysnqn, NVME_DISC_SUBSYS_NAME)) + return NULL; + list_for_each_entry(subsys, &nvme_subsystems, entry) { if (strcmp(subsys->subnqn, subsysnqn)) continue; @@ -2461,12 +2504,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); if (!subsys) return -ENOMEM; - ret = ida_simple_get(&nvme_subsystems_ida, 0, 0, GFP_KERNEL); - if (ret < 0) { - kfree(subsys); - return ret; - } - subsys->instance = ret; + + subsys->instance = -1; mutex_init(&subsys->lock); kref_init(&subsys->ref); INIT_LIST_HEAD(&subsys->ctrls); @@ -2485,7 +2524,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) subsys->dev.class = nvme_subsys_class; subsys->dev.release = nvme_release_subsystem; subsys->dev.groups = nvme_subsys_attrs_groups; - dev_set_name(&subsys->dev, "nvme-subsys%d", subsys->instance); + dev_set_name(&subsys->dev, "nvme-subsys%d", ctrl->instance); device_initialize(&subsys->dev); mutex_lock(&nvme_subsystems_lock); @@ -2517,6 +2556,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) goto out_put_subsystem; } + if (!found) + subsys->instance = ctrl->instance; ctrl->subsys = subsys; list_add_tail(&ctrl->subsys_entry, &subsys->ctrls); mutex_unlock(&nvme_subsystems_lock); @@ -2574,7 +2615,6 @@ static int nvme_get_effects_log(struct nvme_ctrl *ctrl) int nvme_init_identify(struct nvme_ctrl *ctrl) { struct nvme_id_ctrl *id; - u64 cap; int ret, page_shift; u32 max_hw_sectors; bool prev_apst_enabled; @@ -2584,16 +2624,11 @@ int nvme_init_identify(struct nvme_ctrl *ctrl) dev_err(ctrl->device, "Reading VS failed (%d)\n", ret); return ret; } - - ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap); - if (ret) { - dev_err(ctrl->device, "Reading CAP failed (%d)\n", ret); - return ret; - } - page_shift = NVME_CAP_MPSMIN(cap) + 12; + page_shift = NVME_CAP_MPSMIN(ctrl->cap) + 12; + ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize); if (ctrl->vs >= NVME_VS(1, 1, 0)) - ctrl->subsystem = NVME_CAP_NSSRC(cap); + ctrl->subsystem = NVME_CAP_NSSRC(ctrl->cap); ret = nvme_identify_ctrl(ctrl, &id); if (ret) { @@ -3184,7 +3219,9 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl, head->ns_id = nsid; kref_init(&head->ref); - nvme_report_ns_ids(ctrl, nsid, id, &head->ids); + ret = nvme_report_ns_ids(ctrl, nsid, id, &head->ids); + if (ret) + goto out_cleanup_srcu; ret = __nvme_check_ids(ctrl->subsys, head); if (ret) { @@ -3209,6 +3246,8 @@ out_ida_remove: out_free_head: kfree(head); out: + if (ret > 0) + ret = blk_status_to_errno(nvme_error_status(ret)); return ERR_PTR(ret); } @@ -3232,7 +3271,10 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid, } else { struct nvme_ns_ids ids; - nvme_report_ns_ids(ctrl, nsid, id, &ids); + ret = nvme_report_ns_ids(ctrl, nsid, id, &ids); + if (ret) + goto out_unlock; + if (!nvme_ns_ids_equal(&head->ids, &ids)) { dev_err(ctrl->device, "IDs don't match for shared namespace %d\n", @@ -3247,6 +3289,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid, out_unlock: mutex_unlock(&ctrl->subsys->lock); + if (ret > 0) + ret = blk_status_to_errno(nvme_error_status(ret)); return ret; } @@ -3338,11 +3382,9 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift); nvme_set_queue_limits(ctrl, ns->queue); - id = nvme_identify_ns(ctrl, nsid); - if (!id) { - ret = -EIO; + ret = nvme_identify_ns(ctrl, nsid, &id); + if (ret) goto out_free_queue; - } if (id->ncap == 0) { ret = -EINVAL; @@ -3404,6 +3446,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) blk_cleanup_queue(ns->queue); out_free_ns: kfree(ns); + if (ret > 0) + ret = blk_status_to_errno(nvme_error_status(ret)); return ret; } @@ -3617,6 +3661,33 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_remove_namespaces); +static int nvme_class_uevent(struct device *dev, struct kobj_uevent_env *env) +{ + struct nvme_ctrl *ctrl = + container_of(dev, struct nvme_ctrl, ctrl_device); + struct nvmf_ctrl_options *opts = ctrl->opts; + int ret; + + ret = add_uevent_var(env, "NVME_TRTYPE=%s", ctrl->ops->name); + if (ret) + return ret; + + if (opts) { + ret = add_uevent_var(env, "NVME_TRADDR=%s", opts->traddr); + if (ret) + return ret; + + ret = add_uevent_var(env, "NVME_TRSVCID=%s", + opts->trsvcid ?: "none"); + if (ret) + return ret; + + ret = add_uevent_var(env, "NVME_HOST_TRADDR=%s", + opts->host_traddr ?: "none"); + } + return ret; +} + static void nvme_aen_uevent(struct nvme_ctrl *ctrl) { char *envp[2] = { NULL, NULL }; @@ -3723,6 +3794,9 @@ static void nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result) queue_work(nvme_wq, &ctrl->ana_work); break; #endif + case NVME_AER_NOTICE_DISC_CHANGED: + ctrl->aen_result = result; + break; default: dev_warn(ctrl->device, "async event result %08x\n", result); } @@ -3769,10 +3843,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl) if (ctrl->kato) nvme_start_keep_alive(ctrl); + nvme_enable_aen(ctrl); + if (ctrl->queue_count > 1) { nvme_queue_scan(ctrl); - nvme_enable_aen(ctrl); - queue_work(nvme_wq, &ctrl->async_event_work); nvme_start_queues(ctrl); } } @@ -3792,7 +3866,9 @@ static void nvme_free_ctrl(struct device *dev) container_of(dev, struct nvme_ctrl, ctrl_device); struct nvme_subsystem *subsys = ctrl->subsys; - ida_simple_remove(&nvme_instance_ida, ctrl->instance); + if (subsys && ctrl->instance != subsys->instance) + ida_simple_remove(&nvme_instance_ida, ctrl->instance); + kfree(ctrl->effects); nvme_mpath_uninit(ctrl); __free_page(ctrl->discard_page); @@ -3992,6 +4068,9 @@ void nvme_sync_queues(struct nvme_ctrl *ctrl) list_for_each_entry(ns, &ctrl->namespaces, list) blk_sync_queue(ns->queue); up_read(&ctrl->namespaces_rwsem); + + if (ctrl->admin_q) + blk_sync_queue(ctrl->admin_q); } EXPORT_SYMBOL_GPL(nvme_sync_queues); @@ -4050,6 +4129,7 @@ static int __init nvme_core_init(void) result = PTR_ERR(nvme_class); goto unregister_chrdev; } + nvme_class->dev_uevent = nvme_class_uevent; nvme_subsys_class = class_create(THIS_MODULE, "nvme-subsystem"); if (IS_ERR(nvme_subsys_class)) { @@ -4074,7 +4154,6 @@ out: static void __exit nvme_core_exit(void) { - ida_destroy(&nvme_subsystems_ida); class_destroy(nvme_subsys_class); class_destroy(nvme_class); unregister_chrdev_region(nvme_chr_devt, NVME_MINORS); diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 1994d5b42f94..74b8818ac9a1 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -150,7 +150,7 @@ int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val) cmd.prop_get.fctype = nvme_fabrics_type_property_get; cmd.prop_get.offset = cpu_to_le32(off); - ret = __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, &res, NULL, 0, 0, + ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0, 0, NVME_QID_ANY, 0, 0, false); if (ret >= 0) @@ -197,7 +197,7 @@ int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val) cmd.prop_get.attrib = 1; cmd.prop_get.offset = cpu_to_le32(off); - ret = __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, &res, NULL, 0, 0, + ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0, 0, NVME_QID_ANY, 0, 0, false); if (ret >= 0) @@ -243,7 +243,7 @@ int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val) cmd.prop_set.offset = cpu_to_le32(off); cmd.prop_set.value = cpu_to_le64(val); - ret = __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, NULL, NULL, 0, 0, + ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, NULL, NULL, 0, 0, NVME_QID_ANY, 0, 0, false); if (unlikely(ret)) dev_err(ctrl->device, @@ -381,8 +381,8 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl) * Set keep-alive timeout in seconds granularity (ms * 1000) * and add a grace period for controller kato enforcement */ - cmd.connect.kato = ctrl->opts->discovery_nqn ? 0 : - cpu_to_le32((ctrl->kato + NVME_KATO_GRACE) * 1000); + cmd.connect.kato = ctrl->kato ? + cpu_to_le32((ctrl->kato + NVME_KATO_GRACE) * 1000) : 0; if (ctrl->opts->disable_sqflow) cmd.connect.cattr |= NVME_CONNECT_DISABLE_SQFLOW; @@ -396,7 +396,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl) strncpy(data->subsysnqn, ctrl->opts->subsysnqn, NVMF_NQN_SIZE); strncpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE); - ret = __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, &res, + ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, data, sizeof(*data), 0, NVME_QID_ANY, 1, BLK_MQ_REQ_RESERVED | BLK_MQ_REQ_NOWAIT, false); if (ret) { @@ -611,6 +611,7 @@ static const match_table_t opt_tokens = { { NVMF_OPT_DATA_DIGEST, "data_digest" }, { NVMF_OPT_NR_WRITE_QUEUES, "nr_write_queues=%d" }, { NVMF_OPT_NR_POLL_QUEUES, "nr_poll_queues=%d" }, + { NVMF_OPT_TOS, "tos=%d" }, { NVMF_OPT_ERR, NULL } }; @@ -632,6 +633,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, opts->duplicate_connect = false; opts->hdr_digest = false; opts->data_digest = false; + opts->tos = -1; /* < 0 == use transport default */ options = o = kstrdup(buf, GFP_KERNEL); if (!options) @@ -738,13 +740,6 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, pr_warn("keep_alive_tmo 0 won't execute keep alives!!!\n"); } opts->kato = token; - - if (opts->discovery_nqn && opts->kato) { - pr_err("Discovery controllers cannot accept KATO != 0\n"); - ret = -EINVAL; - goto out; - } - break; case NVMF_OPT_CTRL_LOSS_TMO: if (match_int(args, &token)) { @@ -856,6 +851,22 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, } opts->nr_poll_queues = token; break; + case NVMF_OPT_TOS: + if (match_int(args, &token)) { + ret = -EINVAL; + goto out; + } + if (token < 0) { + pr_err("Invalid type of service %d\n", token); + ret = -EINVAL; + goto out; + } + if (token > 255) { + pr_warn("Clamping type of service to 255\n"); + token = 255; + } + opts->tos = token; + break; default: pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", p); @@ -865,7 +876,6 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, } if (opts->discovery_nqn) { - opts->kato = 0; opts->nr_io_queues = 0; opts->nr_write_queues = 0; opts->nr_poll_queues = 0; diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index 3044d8b99a24..93f08d77c896 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -55,6 +55,7 @@ enum { NVMF_OPT_DATA_DIGEST = 1 << 16, NVMF_OPT_NR_WRITE_QUEUES = 1 << 17, NVMF_OPT_NR_POLL_QUEUES = 1 << 18, + NVMF_OPT_TOS = 1 << 19, }; /** @@ -87,6 +88,7 @@ enum { * @data_digest: generate/verify data digest (TCP) * @nr_write_queues: number of queues for write I/O * @nr_poll_queues: number of queues for polling I/O + * @tos: type of service */ struct nvmf_ctrl_options { unsigned mask; @@ -108,6 +110,7 @@ struct nvmf_ctrl_options { bool data_digest; unsigned int nr_write_queues; unsigned int nr_poll_queues; + int tos; }; /* diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 232d8094091b..265f89e11d8b 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -1608,9 +1608,13 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) sizeof(op->rsp_iu), DMA_FROM_DEVICE); if (opstate == FCPOP_STATE_ABORTED) - status = cpu_to_le16(NVME_SC_ABORT_REQ << 1); - else if (freq->status) - status = cpu_to_le16(NVME_SC_INTERNAL << 1); + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); + else if (freq->status) { + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); + dev_info(ctrl->ctrl.device, + "NVME-FC{%d}: io failed due to lldd error %d\n", + ctrl->cnum, freq->status); + } /* * For the linux implementation, if we have an unsuccesful @@ -1637,8 +1641,13 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) * no payload in the CQE by the transport. */ if (freq->transferred_length != - be32_to_cpu(op->cmd_iu.data_len)) { - status = cpu_to_le16(NVME_SC_INTERNAL << 1); + be32_to_cpu(op->cmd_iu.data_len)) { + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); + dev_info(ctrl->ctrl.device, + "NVME-FC{%d}: io failed due to bad transfer " + "length: %d vs expected %d\n", + ctrl->cnum, freq->transferred_length, + be32_to_cpu(op->cmd_iu.data_len)); goto done; } result.u64 = 0; @@ -1655,7 +1664,17 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) freq->transferred_length || op->rsp_iu.status_code || sqe->common.command_id != cqe->command_id)) { - status = cpu_to_le16(NVME_SC_INTERNAL << 1); + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); + dev_info(ctrl->ctrl.device, + "NVME-FC{%d}: io failed due to bad NVMe_ERSP: " + "iu len %d, xfr len %d vs %d, status code " + "%d, cmdid %d vs %d\n", + ctrl->cnum, be16_to_cpu(op->rsp_iu.iu_len), + be32_to_cpu(op->rsp_iu.xfrd_len), + freq->transferred_length, + op->rsp_iu.status_code, + sqe->common.command_id, + cqe->command_id); goto done; } result = cqe->result; @@ -1663,7 +1682,11 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) break; default: - status = cpu_to_le16(NVME_SC_INTERNAL << 1); + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); + dev_info(ctrl->ctrl.device, + "NVME-FC{%d}: io failed due to odd NVMe_xRSP iu " + "len %d\n", + ctrl->cnum, freq->rcv_rsplen); goto done; } @@ -2006,6 +2029,7 @@ nvme_fc_ctrl_free(struct kref *ref) blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); blk_cleanup_queue(ctrl->ctrl.admin_q); + blk_cleanup_queue(ctrl->ctrl.fabrics_q); blk_mq_free_tag_set(&ctrl->admin_tag_set); kfree(ctrl->queues); @@ -2107,7 +2131,6 @@ nvme_fc_map_data(struct nvme_fc_ctrl *ctrl, struct request *rq, struct nvme_fc_fcp_op *op) { struct nvmefc_fcp_req *freq = &op->fcp_req; - enum dma_data_direction dir; int ret; freq->sg_cnt = 0; @@ -2124,9 +2147,8 @@ nvme_fc_map_data(struct nvme_fc_ctrl *ctrl, struct request *rq, op->nents = blk_rq_map_sg(rq->q, rq, freq->sg_table.sgl); WARN_ON(op->nents > blk_rq_nr_phys_segments(rq)); - dir = (rq_data_dir(rq) == WRITE) ? DMA_TO_DEVICE : DMA_FROM_DEVICE; freq->sg_cnt = fc_dma_map_sg(ctrl->lport->dev, freq->sg_table.sgl, - op->nents, dir); + op->nents, rq_dma_dir(rq)); if (unlikely(freq->sg_cnt <= 0)) { sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE); freq->sg_cnt = 0; @@ -2149,8 +2171,7 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl, struct request *rq, return; fc_dma_unmap_sg(ctrl->lport->dev, freq->sg_table.sgl, op->nents, - ((rq_data_dir(rq) == WRITE) ? - DMA_TO_DEVICE : DMA_FROM_DEVICE)); + rq_dma_dir(rq)); nvme_cleanup_cmd(rq); @@ -2633,8 +2654,6 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl) if (ret) goto out_delete_hw_queue; - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); - ret = nvmf_connect_admin_queue(&ctrl->ctrl); if (ret) goto out_disconnect_admin_queue; @@ -2648,23 +2667,15 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl) * prior connection values */ - ret = nvmf_reg_read64(&ctrl->ctrl, NVME_REG_CAP, &ctrl->ctrl.cap); - if (ret) { - dev_err(ctrl->ctrl.device, - "prop_get NVME_REG_CAP failed\n"); - goto out_disconnect_admin_queue; - } - - ctrl->ctrl.sqsize = - min_t(int, NVME_CAP_MQES(ctrl->ctrl.cap), ctrl->ctrl.sqsize); - - ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); + ret = nvme_enable_ctrl(&ctrl->ctrl); if (ret) goto out_disconnect_admin_queue; ctrl->ctrl.max_hw_sectors = (ctrl->lport->ops->max_sgl_segments - 1) << (PAGE_SHIFT - 9); + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + ret = nvme_init_identify(&ctrl->ctrl); if (ret) goto out_disconnect_admin_queue; @@ -2774,6 +2785,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl) nvme_stop_queues(&ctrl->ctrl); blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_fc_terminate_exchange, &ctrl->ctrl); + blk_mq_tagset_wait_completed_request(&ctrl->tag_set); } /* @@ -2796,6 +2808,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl) blk_mq_quiesce_queue(ctrl->ctrl.admin_q); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_fc_terminate_exchange, &ctrl->ctrl); + blk_mq_tagset_wait_completed_request(&ctrl->admin_tag_set); /* kill the aens as they are a separate path */ nvme_fc_abort_aen_ops(ctrl); @@ -3109,10 +3122,16 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts, goto out_free_queues; ctrl->ctrl.admin_tagset = &ctrl->admin_tag_set; + ctrl->ctrl.fabrics_q = blk_mq_init_queue(&ctrl->admin_tag_set); + if (IS_ERR(ctrl->ctrl.fabrics_q)) { + ret = PTR_ERR(ctrl->ctrl.fabrics_q); + goto out_free_admin_tag_set; + } + ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set); if (IS_ERR(ctrl->ctrl.admin_q)) { ret = PTR_ERR(ctrl->ctrl.admin_q); - goto out_free_admin_tag_set; + goto out_cleanup_fabrics_q; } /* @@ -3184,6 +3203,8 @@ fail_ctrl: out_cleanup_admin_q: blk_cleanup_queue(ctrl->ctrl.admin_q); +out_cleanup_fabrics_q: + blk_cleanup_queue(ctrl->ctrl.fabrics_q); out_free_admin_tag_set: blk_mq_free_tag_set(&ctrl->admin_tag_set); out_free_queues: diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index ba009d4c9dfa..ec46693f6b64 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -667,11 +667,14 @@ static struct request *nvme_nvm_alloc_request(struct request_queue *q, return rq; } -static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd) +static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd, + void *buf) { + struct nvm_geo *geo = &dev->geo; struct request_queue *q = dev->q; struct nvme_nvm_command *cmd; struct request *rq; + int ret; cmd = kzalloc(sizeof(struct nvme_nvm_command), GFP_KERNEL); if (!cmd) @@ -679,8 +682,15 @@ static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd) rq = nvme_nvm_alloc_request(q, rqd, cmd); if (IS_ERR(rq)) { - kfree(cmd); - return PTR_ERR(rq); + ret = PTR_ERR(rq); + goto err_free_cmd; + } + + if (buf) { + ret = blk_rq_map_kern(q, rq, buf, geo->csecs * rqd->nr_ppas, + GFP_KERNEL); + if (ret) + goto err_free_cmd; } rq->end_io_data = rqd; @@ -688,33 +698,9 @@ static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd) blk_execute_rq_nowait(q, NULL, rq, 0, nvme_nvm_end_io); return 0; -} - -static int nvme_nvm_submit_io_sync(struct nvm_dev *dev, struct nvm_rq *rqd) -{ - struct request_queue *q = dev->q; - struct request *rq; - struct nvme_nvm_command cmd; - int ret = 0; - - memset(&cmd, 0, sizeof(struct nvme_nvm_command)); - - rq = nvme_nvm_alloc_request(q, rqd, &cmd); - if (IS_ERR(rq)) - return PTR_ERR(rq); - - /* I/Os can fail and the error is signaled through rqd. Callers must - * handle the error accordingly. - */ - blk_execute_rq(q, NULL, rq, 0); - if (nvme_req(rq)->flags & NVME_REQ_CANCELLED) - ret = -EINTR; - - rqd->ppa_status = le64_to_cpu(nvme_req(rq)->result.u64); - rqd->error = nvme_req(rq)->status; - - blk_mq_free_request(rq); +err_free_cmd: + kfree(cmd); return ret; } @@ -754,7 +740,6 @@ static struct nvm_dev_ops nvme_nvm_dev_ops = { .get_chk_meta = nvme_nvm_get_chk_meta, .submit_io = nvme_nvm_submit_io, - .submit_io_sync = nvme_nvm_submit_io_sync, .create_dma_pool = nvme_nvm_create_dma_pool, .destroy_dma_pool = nvme_nvm_destroy_dma_pool, diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index af831d3d15d0..30de7efef003 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -509,14 +509,16 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl, down_write(&ctrl->namespaces_rwsem); list_for_each_entry(ns, &ctrl->namespaces, list) { - if (ns->head->ns_id != le32_to_cpu(desc->nsids[n])) + unsigned nsid = le32_to_cpu(desc->nsids[n]); + + if (ns->head->ns_id < nsid) continue; - nvme_update_ns_ana_state(desc, ns); + if (ns->head->ns_id == nsid) + nvme_update_ns_ana_state(desc, ns); if (++n == nr_nsids) break; } up_write(&ctrl->namespaces_rwsem); - WARN_ON_ONCE(n < nr_nsids); return 0; } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 2d678fb968c7..b5013c101b35 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -16,6 +16,8 @@ #include <linux/fault-inject.h> #include <linux/rcupdate.h> +#include <trace/events/block.h> + extern unsigned int nvme_io_timeout; #define NVME_IO_TIMEOUT (nvme_io_timeout * HZ) @@ -97,6 +99,21 @@ enum nvme_quirks { * Force simple suspend/resume path. */ NVME_QUIRK_SIMPLE_SUSPEND = (1 << 10), + + /* + * Use only one interrupt vector for all queues + */ + NVME_QUIRK_SINGLE_VECTOR = (1 << 11), + + /* + * Use non-standard 128 bytes SQEs. + */ + NVME_QUIRK_128_BYTES_SQES = (1 << 12), + + /* + * Prevent tag overlap between queues + */ + NVME_QUIRK_SHARED_TAGS = (1 << 13), }; /* @@ -169,6 +186,7 @@ struct nvme_ctrl { const struct nvme_ctrl_ops *ops; struct request_queue *admin_q; struct request_queue *connect_q; + struct request_queue *fabrics_q; struct device *dev; int instance; int numa_node; @@ -431,8 +449,8 @@ void nvme_complete_rq(struct request *req); bool nvme_cancel_request(struct request *req, void *data, bool reserved); bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, enum nvme_ctrl_state new_state); -int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap); -int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap); +int nvme_disable_ctrl(struct nvme_ctrl *ctrl); +int nvme_enable_ctrl(struct nvme_ctrl *ctrl); int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl); int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, const struct nvme_ctrl_ops *ops, unsigned long quirks); @@ -520,6 +538,16 @@ static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) kblockd_schedule_work(&head->requeue_work); } +static inline void nvme_trace_bio_complete(struct request *req, + blk_status_t status) +{ + struct nvme_ns *ns = req->q->queuedata; + + if (req->cmd_flags & REQ_NVME_MPATH) + trace_block_bio_complete(ns->head->disk->queue, + req->bio, status); +} + extern struct device_attribute dev_attr_ana_grpid; extern struct device_attribute dev_attr_ana_state; extern struct device_attribute subsys_attr_iopolicy; @@ -567,6 +595,10 @@ static inline void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl) static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) { } +static inline void nvme_trace_bio_complete(struct request *req, + blk_status_t status) +{ +} static inline int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) { diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 732d5b63ec05..6b4d7b064b38 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -28,8 +28,8 @@ #include "trace.h" #include "nvme.h" -#define SQ_SIZE(depth) (depth * sizeof(struct nvme_command)) -#define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion)) +#define SQ_SIZE(q) ((q)->q_depth << (q)->sqes) +#define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion)) #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc)) @@ -100,6 +100,7 @@ struct nvme_dev { unsigned io_queues[HCTX_MAX_TYPES]; unsigned int num_vecs; int q_depth; + int io_sqes; u32 db_stride; void __iomem *bar; unsigned long bar_mapped_size; @@ -162,7 +163,7 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl) struct nvme_queue { struct nvme_dev *dev; spinlock_t sq_lock; - struct nvme_command *sq_cmds; + void *sq_cmds; /* only used for poll queues: */ spinlock_t cq_poll_lock ____cacheline_aligned_in_smp; volatile struct nvme_completion *cqes; @@ -178,6 +179,7 @@ struct nvme_queue { u16 last_cq_head; u16 qid; u8 cq_phase; + u8 sqes; unsigned long flags; #define NVMEQ_ENABLED 0 #define NVMEQ_SQ_CMB 1 @@ -488,7 +490,8 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd, bool write_sq) { spin_lock(&nvmeq->sq_lock); - memcpy(&nvmeq->sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd)); + memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes), + cmd, sizeof(*cmd)); if (++nvmeq->sq_tail == nvmeq->q_depth) nvmeq->sq_tail = 0; nvme_write_sq_db(nvmeq, write_sq); @@ -534,14 +537,13 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req) static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); - enum dma_data_direction dma_dir = rq_data_dir(req) ? - DMA_TO_DEVICE : DMA_FROM_DEVICE; const int last_prp = dev->ctrl.page_size / sizeof(__le64) - 1; dma_addr_t dma_addr = iod->first_dma, next_dma_addr; int i; if (iod->dma_len) { - dma_unmap_page(dev->dev, dma_addr, iod->dma_len, dma_dir); + dma_unmap_page(dev->dev, dma_addr, iod->dma_len, + rq_dma_dir(req)); return; } @@ -1344,16 +1346,16 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) static void nvme_free_queue(struct nvme_queue *nvmeq) { - dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq->q_depth), + dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq), (void *)nvmeq->cqes, nvmeq->cq_dma_addr); if (!nvmeq->sq_cmds) return; if (test_and_clear_bit(NVMEQ_SQ_CMB, &nvmeq->flags)) { pci_free_p2pmem(to_pci_dev(nvmeq->dev->dev), - nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth)); + nvmeq->sq_cmds, SQ_SIZE(nvmeq)); } else { - dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq->q_depth), + dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq), nvmeq->sq_cmds, nvmeq->sq_dma_addr); } } @@ -1403,7 +1405,7 @@ static void nvme_disable_admin_queue(struct nvme_dev *dev, bool shutdown) if (shutdown) nvme_shutdown_ctrl(&dev->ctrl); else - nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap); + nvme_disable_ctrl(&dev->ctrl); nvme_poll_irqdisable(nvmeq, -1); } @@ -1433,12 +1435,12 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues, } static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq, - int qid, int depth) + int qid) { struct pci_dev *pdev = to_pci_dev(dev->dev); if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { - nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth)); + nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(nvmeq)); if (nvmeq->sq_cmds) { nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev, nvmeq->sq_cmds); @@ -1447,11 +1449,11 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq, return 0; } - pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(depth)); + pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(nvmeq)); } } - nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth), + nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(nvmeq), &nvmeq->sq_dma_addr, GFP_KERNEL); if (!nvmeq->sq_cmds) return -ENOMEM; @@ -1465,12 +1467,14 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth) if (dev->ctrl.queue_count > qid) return 0; - nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(depth), + nvmeq->sqes = qid ? dev->io_sqes : NVME_ADM_SQES; + nvmeq->q_depth = depth; + nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(nvmeq), &nvmeq->cq_dma_addr, GFP_KERNEL); if (!nvmeq->cqes) goto free_nvmeq; - if (nvme_alloc_sq_cmds(dev, nvmeq, qid, depth)) + if (nvme_alloc_sq_cmds(dev, nvmeq, qid)) goto free_cqdma; nvmeq->dev = dev; @@ -1479,15 +1483,14 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth) nvmeq->cq_head = 0; nvmeq->cq_phase = 1; nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride]; - nvmeq->q_depth = depth; nvmeq->qid = qid; dev->ctrl.queue_count++; return 0; free_cqdma: - dma_free_coherent(dev->dev, CQ_SIZE(depth), (void *)nvmeq->cqes, - nvmeq->cq_dma_addr); + dma_free_coherent(dev->dev, CQ_SIZE(nvmeq), (void *)nvmeq->cqes, + nvmeq->cq_dma_addr); free_nvmeq: return -ENOMEM; } @@ -1515,7 +1518,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid) nvmeq->cq_head = 0; nvmeq->cq_phase = 1; nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride]; - memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth)); + memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq)); nvme_dbbuf_init(dev, nvmeq, qid); dev->online_queues++; wmb(); /* ensure the first interrupt sees the initialization */ @@ -1552,7 +1555,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled) nvme_init_queue(nvmeq, qid); if (!polled) { - nvmeq->cq_vector = vector; result = queue_request_irq(nvmeq); if (result < 0) goto release_sq; @@ -1679,7 +1681,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev) (readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_NSSRO)) writel(NVME_CSTS_NSSRO, dev->bar + NVME_REG_CSTS); - result = nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap); + result = nvme_disable_ctrl(&dev->ctrl); if (result < 0) return result; @@ -1695,7 +1697,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev) lo_hi_writeq(nvmeq->sq_dma_addr, dev->bar + NVME_REG_ASQ); lo_hi_writeq(nvmeq->cq_dma_addr, dev->bar + NVME_REG_ACQ); - result = nvme_enable_ctrl(&dev->ctrl, dev->ctrl.cap); + result = nvme_enable_ctrl(&dev->ctrl); if (result) return result; @@ -2077,6 +2079,13 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues) dev->io_queues[HCTX_TYPE_DEFAULT] = 1; dev->io_queues[HCTX_TYPE_READ] = 0; + /* + * Some Apple controllers require all queues to use the + * first vector. + */ + if (dev->ctrl.quirks & NVME_QUIRK_SINGLE_VECTOR) + irq_queues = 1; + return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd); } @@ -2095,6 +2104,14 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) unsigned long size; nr_io_queues = max_io_queues(); + + /* + * If tags are shared with admin queue (Apple bug), then + * make sure we only use one IO queue. + */ + if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) + nr_io_queues = 1; + result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues); if (result < 0) return result; @@ -2265,6 +2282,14 @@ static int nvme_dev_add(struct nvme_dev *dev) dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE; dev->tagset.driver_data = dev; + /* + * Some Apple controllers requires tags to be unique + * across admin and IO queue, so reserve the first 32 + * tags of the IO queue. + */ + if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) + dev->tagset.reserved_tags = NVME_AQ_DEPTH; + ret = blk_mq_alloc_tag_set(&dev->tagset); if (ret) { dev_warn(dev->ctrl.device, @@ -2314,10 +2339,21 @@ static int nvme_pci_enable(struct nvme_dev *dev) dev->q_depth = min_t(int, NVME_CAP_MQES(dev->ctrl.cap) + 1, io_queue_depth); + dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap); dev->dbs = dev->bar + 4096; /* + * Some Apple controllers require a non-standard SQE size. + * Interestingly they also seem to ignore the CC:IOSQES register + * so we don't bother updating it here. + */ + if (dev->ctrl.quirks & NVME_QUIRK_128_BYTES_SQES) + dev->io_sqes = 7; + else + dev->io_sqes = NVME_NVM_IOSQES; + + /* * Temporary fix for the Apple controller found in the MacBook8,1 and * some MacBook7,1 to avoid controller resets and data loss. */ @@ -2334,6 +2370,18 @@ static int nvme_pci_enable(struct nvme_dev *dev) "set queue depth=%u\n", dev->q_depth); } + /* + * Controllers with the shared tags quirk need the IO queue to be + * big enough so that we get 32 tags for the admin queue + */ + if ((dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) && + (dev->q_depth < (NVME_AQ_DEPTH + 2))) { + dev->q_depth = NVME_AQ_DEPTH + 2; + dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n", + dev->q_depth); + } + + nvme_map_cmb(dev); pci_enable_pcie_error_reporting(pdev); @@ -2401,6 +2449,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl); blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl); + blk_mq_tagset_wait_completed_request(&dev->tagset); + blk_mq_tagset_wait_completed_request(&dev->admin_tagset); /* * The driver will not be starting up queues again if shutting down so @@ -3041,6 +3091,10 @@ static const struct pci_device_id nvme_id_table[] = { { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) }, { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) }, { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) }, + { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2005), + .driver_data = NVME_QUIRK_SINGLE_VECTOR | + NVME_QUIRK_128_BYTES_SQES | + NVME_QUIRK_SHARED_TAGS }, { 0, } }; MODULE_DEVICE_TABLE(pci, nvme_id_table); diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 1a6449bc547b..dfa07bb9dfeb 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -757,6 +757,7 @@ static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl, { if (remove) { blk_cleanup_queue(ctrl->ctrl.admin_q); + blk_cleanup_queue(ctrl->ctrl.fabrics_q); blk_mq_free_tag_set(ctrl->ctrl.admin_tagset); } if (ctrl->async_event_sqe.data) { @@ -798,10 +799,16 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, goto out_free_async_qe; } + ctrl->ctrl.fabrics_q = blk_mq_init_queue(&ctrl->admin_tag_set); + if (IS_ERR(ctrl->ctrl.fabrics_q)) { + error = PTR_ERR(ctrl->ctrl.fabrics_q); + goto out_free_tagset; + } + ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set); if (IS_ERR(ctrl->ctrl.admin_q)) { error = PTR_ERR(ctrl->ctrl.admin_q); - goto out_free_tagset; + goto out_cleanup_fabrics_q; } } @@ -809,24 +816,15 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, if (error) goto out_cleanup_queue; - error = ctrl->ctrl.ops->reg_read64(&ctrl->ctrl, NVME_REG_CAP, - &ctrl->ctrl.cap); - if (error) { - dev_err(ctrl->ctrl.device, - "prop_get NVME_REG_CAP failed\n"); - goto out_stop_queue; - } - - ctrl->ctrl.sqsize = - min_t(int, NVME_CAP_MQES(ctrl->ctrl.cap), ctrl->ctrl.sqsize); - - error = nvme_enable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); + error = nvme_enable_ctrl(&ctrl->ctrl); if (error) goto out_stop_queue; ctrl->ctrl.max_hw_sectors = (ctrl->max_fr_pages - 1) << (ilog2(SZ_4K) - 9); + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + error = nvme_init_identify(&ctrl->ctrl); if (error) goto out_stop_queue; @@ -838,6 +836,9 @@ out_stop_queue: out_cleanup_queue: if (new) blk_cleanup_queue(ctrl->ctrl.admin_q); +out_cleanup_fabrics_q: + if (new) + blk_cleanup_queue(ctrl->ctrl.fabrics_q); out_free_tagset: if (new) blk_mq_free_tag_set(ctrl->ctrl.admin_tagset); @@ -907,10 +908,13 @@ static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl, { blk_mq_quiesce_queue(ctrl->ctrl.admin_q); nvme_rdma_stop_queue(&ctrl->queues[0]); - if (ctrl->ctrl.admin_tagset) + if (ctrl->ctrl.admin_tagset) { blk_mq_tagset_busy_iter(ctrl->ctrl.admin_tagset, nvme_cancel_request, &ctrl->ctrl); - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + blk_mq_tagset_wait_completed_request(ctrl->ctrl.admin_tagset); + } + if (remove) + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); nvme_rdma_destroy_admin_queue(ctrl, remove); } @@ -920,9 +924,11 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl, if (ctrl->ctrl.queue_count > 1) { nvme_stop_queues(&ctrl->ctrl); nvme_rdma_stop_io_queues(ctrl); - if (ctrl->ctrl.tagset) + if (ctrl->ctrl.tagset) { blk_mq_tagset_busy_iter(ctrl->ctrl.tagset, nvme_cancel_request, &ctrl->ctrl); + blk_mq_tagset_wait_completed_request(ctrl->ctrl.tagset); + } if (remove) nvme_start_queues(&ctrl->ctrl); nvme_rdma_destroy_io_queues(ctrl, remove); @@ -1059,6 +1065,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) nvme_rdma_teardown_io_queues(ctrl, false); nvme_start_queues(&ctrl->ctrl); nvme_rdma_teardown_admin_queue(ctrl, false); + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { /* state change failure is ok if we're in DELETING state */ @@ -1145,9 +1152,7 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue, req->mr = NULL; } - ib_dma_unmap_sg(ibdev, req->sg_table.sgl, - req->nents, rq_data_dir(rq) == - WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE); + ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq)); nvme_cleanup_cmd(rq); sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE); @@ -1273,7 +1278,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue, req->nents = blk_rq_map_sg(rq->q, rq, req->sg_table.sgl); count = ib_dma_map_sg(ibdev, req->sg_table.sgl, req->nents, - rq_data_dir(rq) == WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE); + rq_dma_dir(rq)); if (unlikely(count <= 0)) { ret = -EIO; goto out_free_table; @@ -1302,9 +1307,7 @@ out: return 0; out_unmap_sg: - ib_dma_unmap_sg(ibdev, req->sg_table.sgl, - req->nents, rq_data_dir(rq) == - WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE); + ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq)); out_free_table: sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE); return ret; @@ -1547,16 +1550,18 @@ static int nvme_rdma_conn_rejected(struct nvme_rdma_queue *queue, static int nvme_rdma_addr_resolved(struct nvme_rdma_queue *queue) { + struct nvme_ctrl *ctrl = &queue->ctrl->ctrl; int ret; ret = nvme_rdma_create_queue_ib(queue); if (ret) return ret; + if (ctrl->opts->tos >= 0) + rdma_set_service_type(queue->cm_id, ctrl->opts->tos); ret = rdma_resolve_route(queue->cm_id, NVME_RDMA_CONNECT_TIMEOUT_MS); if (ret) { - dev_err(queue->ctrl->ctrl.device, - "rdma_resolve_route failed (%d).\n", + dev_err(ctrl->device, "rdma_resolve_route failed (%d).\n", queue->cm_error); goto out_destroy_queue; } @@ -1869,10 +1874,11 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown) cancel_delayed_work_sync(&ctrl->reconnect_work); nvme_rdma_teardown_io_queues(ctrl, shutdown); + blk_mq_quiesce_queue(ctrl->ctrl.admin_q); if (shutdown) nvme_shutdown_ctrl(&ctrl->ctrl); else - nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); + nvme_disable_ctrl(&ctrl->ctrl); nvme_rdma_teardown_admin_queue(ctrl, shutdown); } @@ -2051,7 +2057,8 @@ static struct nvmf_transport_ops nvme_rdma_transport = { .required_opts = NVMF_OPT_TRADDR, .allowed_opts = NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY | NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO | - NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES, + NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES | + NVMF_OPT_TOS, .create_ctrl = nvme_rdma_create_ctrl, }; diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 606b13d35d16..4ffd5957637a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -13,6 +13,7 @@ #include <net/tcp.h> #include <linux/blk-mq.h> #include <crypto/hash.h> +#include <net/busy_poll.h> #include "nvme.h" #include "fabrics.h" @@ -72,6 +73,7 @@ struct nvme_tcp_queue { int pdu_offset; size_t data_remaining; size_t ddgst_remaining; + unsigned int nr_cqe; /* send state */ struct nvme_tcp_request *request; @@ -438,6 +440,7 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue, } nvme_end_request(rq, cqe->status, cqe->result); + queue->nr_cqe++; return 0; } @@ -608,23 +611,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb, switch (hdr->type) { case nvme_tcp_c2h_data: - ret = nvme_tcp_handle_c2h_data(queue, (void *)queue->pdu); - break; + return nvme_tcp_handle_c2h_data(queue, (void *)queue->pdu); case nvme_tcp_rsp: nvme_tcp_init_recv_ctx(queue); - ret = nvme_tcp_handle_comp(queue, (void *)queue->pdu); - break; + return nvme_tcp_handle_comp(queue, (void *)queue->pdu); case nvme_tcp_r2t: nvme_tcp_init_recv_ctx(queue); - ret = nvme_tcp_handle_r2t(queue, (void *)queue->pdu); - break; + return nvme_tcp_handle_r2t(queue, (void *)queue->pdu); default: dev_err(queue->ctrl->ctrl.device, "unsupported pdu type (%d)\n", hdr->type); return -EINVAL; } - - return ret; } static inline void nvme_tcp_end_request(struct request *rq, u16 status) @@ -701,8 +699,10 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; } else { - if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) + if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { nvme_tcp_end_request(rq, NVME_SC_SUCCESS); + queue->nr_cqe++; + } nvme_tcp_init_recv_ctx(queue); } } @@ -742,6 +742,7 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue, pdu->command_id); nvme_tcp_end_request(rq, NVME_SC_SUCCESS); + queue->nr_cqe++; } nvme_tcp_init_recv_ctx(queue); @@ -841,7 +842,7 @@ static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue) static void nvme_tcp_fail_request(struct nvme_tcp_request *req) { - nvme_tcp_end_request(blk_mq_rq_from_pdu(req), NVME_SC_DATA_XFER_ERROR); + nvme_tcp_end_request(blk_mq_rq_from_pdu(req), NVME_SC_HOST_PATH_ERROR); } static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) @@ -1023,14 +1024,16 @@ done: static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue) { - struct sock *sk = queue->sock->sk; + struct socket *sock = queue->sock; + struct sock *sk = sock->sk; read_descriptor_t rd_desc; int consumed; rd_desc.arg.data = queue; rd_desc.count = 1; lock_sock(sk); - consumed = tcp_read_sock(sk, &rd_desc, nvme_tcp_recv_skb); + queue->nr_cqe = 0; + consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb); release_sock(sk); return consumed; } @@ -1255,7 +1258,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, queue->queue_size = queue_size; if (qid > 0) - queue->cmnd_capsule_len = ctrl->ctrl.ioccsz * 16; + queue->cmnd_capsule_len = nctrl->ioccsz * 16; else queue->cmnd_capsule_len = sizeof(struct nvme_command) + NVME_TCP_ADMIN_CCSZ; @@ -1263,7 +1266,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, ret = sock_create(ctrl->addr.ss_family, SOCK_STREAM, IPPROTO_TCP, &queue->sock); if (ret) { - dev_err(ctrl->ctrl.device, + dev_err(nctrl->device, "failed to create socket: %d\n", ret); return ret; } @@ -1273,7 +1276,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, TCP_SYNCNT, (char *)&opt, sizeof(opt)); if (ret) { - dev_err(ctrl->ctrl.device, + dev_err(nctrl->device, "failed to set TCP_SYNCNT sock opt %d\n", ret); goto err_sock; } @@ -1283,7 +1286,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, TCP_NODELAY, (char *)&opt, sizeof(opt)); if (ret) { - dev_err(ctrl->ctrl.device, + dev_err(nctrl->device, "failed to set TCP_NODELAY sock opt %d\n", ret); goto err_sock; } @@ -1296,11 +1299,23 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_LINGER, (char *)&sol, sizeof(sol)); if (ret) { - dev_err(ctrl->ctrl.device, + dev_err(nctrl->device, "failed to set SO_LINGER sock opt %d\n", ret); goto err_sock; } + /* Set socket type of service */ + if (nctrl->opts->tos >= 0) { + opt = nctrl->opts->tos; + ret = kernel_setsockopt(queue->sock, SOL_IP, IP_TOS, + (char *)&opt, sizeof(opt)); + if (ret) { + dev_err(nctrl->device, + "failed to set IP_TOS sock opt %d\n", ret); + goto err_sock; + } + } + queue->sock->sk->sk_allocation = GFP_ATOMIC; if (!qid) n = 0; @@ -1314,11 +1329,11 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, queue->pdu_offset = 0; sk_set_memalloc(queue->sock->sk); - if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR) { + if (nctrl->opts->mask & NVMF_OPT_HOST_TRADDR) { ret = kernel_bind(queue->sock, (struct sockaddr *)&ctrl->src_addr, sizeof(ctrl->src_addr)); if (ret) { - dev_err(ctrl->ctrl.device, + dev_err(nctrl->device, "failed to bind queue %d socket %d\n", qid, ret); goto err_sock; @@ -1330,7 +1345,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, if (queue->hdr_digest || queue->data_digest) { ret = nvme_tcp_alloc_crypto(queue); if (ret) { - dev_err(ctrl->ctrl.device, + dev_err(nctrl->device, "failed to allocate queue %d crypto\n", qid); goto err_sock; } @@ -1344,13 +1359,13 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, goto err_crypto; } - dev_dbg(ctrl->ctrl.device, "connecting queue %d\n", + dev_dbg(nctrl->device, "connecting queue %d\n", nvme_tcp_queue_id(queue)); ret = kernel_connect(queue->sock, (struct sockaddr *)&ctrl->addr, sizeof(ctrl->addr), 0); if (ret) { - dev_err(ctrl->ctrl.device, + dev_err(nctrl->device, "failed to connect socket: %d\n", ret); goto err_rcv_pdu; } @@ -1371,6 +1386,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, queue->sock->sk->sk_data_ready = nvme_tcp_data_ready; queue->sock->sk->sk_state_change = nvme_tcp_state_change; queue->sock->sk->sk_write_space = nvme_tcp_write_space; + queue->sock->sk->sk_ll_usec = 1; write_unlock_bh(&queue->sock->sk->sk_callback_lock); return 0; @@ -1469,7 +1485,7 @@ static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl, set->driver_data = ctrl; set->nr_hw_queues = nctrl->queue_count - 1; set->timeout = NVME_IO_TIMEOUT; - set->nr_maps = 2 /* default + read */; + set->nr_maps = nctrl->opts->nr_poll_queues ? HCTX_MAX_TYPES : 2; } ret = blk_mq_alloc_tag_set(set); @@ -1568,6 +1584,7 @@ static unsigned int nvme_tcp_nr_io_queues(struct nvme_ctrl *ctrl) nr_io_queues = min(ctrl->opts->nr_io_queues, num_online_cpus()); nr_io_queues += min(ctrl->opts->nr_write_queues, num_online_cpus()); + nr_io_queues += min(ctrl->opts->nr_poll_queues, num_online_cpus()); return nr_io_queues; } @@ -1599,6 +1616,12 @@ static void nvme_tcp_set_io_queues(struct nvme_ctrl *nctrl, min(opts->nr_io_queues, nr_io_queues); nr_io_queues -= ctrl->io_queues[HCTX_TYPE_DEFAULT]; } + + if (opts->nr_poll_queues && nr_io_queues) { + /* map dedicated poll queues only if we have queues left */ + ctrl->io_queues[HCTX_TYPE_POLL] = + min(opts->nr_poll_queues, nr_io_queues); + } } static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl) @@ -1680,6 +1703,7 @@ static void nvme_tcp_destroy_admin_queue(struct nvme_ctrl *ctrl, bool remove) nvme_tcp_stop_queue(ctrl, 0); if (remove) { blk_cleanup_queue(ctrl->admin_q); + blk_cleanup_queue(ctrl->fabrics_q); blk_mq_free_tag_set(ctrl->admin_tagset); } nvme_tcp_free_admin_queue(ctrl); @@ -1700,10 +1724,16 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new) goto out_free_queue; } + ctrl->fabrics_q = blk_mq_init_queue(ctrl->admin_tagset); + if (IS_ERR(ctrl->fabrics_q)) { + error = PTR_ERR(ctrl->fabrics_q); + goto out_free_tagset; + } + ctrl->admin_q = blk_mq_init_queue(ctrl->admin_tagset); if (IS_ERR(ctrl->admin_q)) { error = PTR_ERR(ctrl->admin_q); - goto out_free_tagset; + goto out_cleanup_fabrics_q; } } @@ -1711,19 +1741,12 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new) if (error) goto out_cleanup_queue; - error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap); - if (error) { - dev_err(ctrl->device, - "prop_get NVME_REG_CAP failed\n"); - goto out_stop_queue; - } - - ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize); - - error = nvme_enable_ctrl(ctrl, ctrl->cap); + error = nvme_enable_ctrl(ctrl); if (error) goto out_stop_queue; + blk_mq_unquiesce_queue(ctrl->admin_q); + error = nvme_init_identify(ctrl); if (error) goto out_stop_queue; @@ -1735,6 +1758,9 @@ out_stop_queue: out_cleanup_queue: if (new) blk_cleanup_queue(ctrl->admin_q); +out_cleanup_fabrics_q: + if (new) + blk_cleanup_queue(ctrl->fabrics_q); out_free_tagset: if (new) blk_mq_free_tag_set(ctrl->admin_tagset); @@ -1748,10 +1774,13 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl, { blk_mq_quiesce_queue(ctrl->admin_q); nvme_tcp_stop_queue(ctrl, 0); - if (ctrl->admin_tagset) + if (ctrl->admin_tagset) { blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_cancel_request, ctrl); - blk_mq_unquiesce_queue(ctrl->admin_q); + blk_mq_tagset_wait_completed_request(ctrl->admin_tagset); + } + if (remove) + blk_mq_unquiesce_queue(ctrl->admin_q); nvme_tcp_destroy_admin_queue(ctrl, remove); } @@ -1762,9 +1791,11 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, return; nvme_stop_queues(ctrl); nvme_tcp_stop_io_queues(ctrl); - if (ctrl->tagset) + if (ctrl->tagset) { blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl); + blk_mq_tagset_wait_completed_request(ctrl->tagset); + } if (remove) nvme_start_queues(ctrl); nvme_tcp_destroy_io_queues(ctrl, remove); @@ -1793,7 +1824,7 @@ static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl) static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new) { struct nvmf_ctrl_options *opts = ctrl->opts; - int ret = -EINVAL; + int ret; ret = nvme_tcp_configure_admin_queue(ctrl, new); if (ret) @@ -1876,6 +1907,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work) /* unquiesce to fail fast pending requests */ nvme_start_queues(ctrl); nvme_tcp_teardown_admin_queue(ctrl, false); + blk_mq_unquiesce_queue(ctrl->admin_q); if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) { /* state change failure is ok if we're in DELETING state */ @@ -1892,10 +1924,11 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown) cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work); nvme_tcp_teardown_io_queues(ctrl, shutdown); + blk_mq_quiesce_queue(ctrl->admin_q); if (shutdown) nvme_shutdown_ctrl(ctrl); else - nvme_disable_ctrl(ctrl, ctrl->cap); + nvme_disable_ctrl(ctrl); nvme_tcp_teardown_admin_queue(ctrl, shutdown); } @@ -2151,14 +2184,36 @@ static int nvme_tcp_map_queues(struct blk_mq_tag_set *set) blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]); blk_mq_map_queues(&set->map[HCTX_TYPE_READ]); + if (opts->nr_poll_queues && ctrl->io_queues[HCTX_TYPE_POLL]) { + /* map dedicated poll queues only if we have queues left */ + set->map[HCTX_TYPE_POLL].nr_queues = + ctrl->io_queues[HCTX_TYPE_POLL]; + set->map[HCTX_TYPE_POLL].queue_offset = + ctrl->io_queues[HCTX_TYPE_DEFAULT] + + ctrl->io_queues[HCTX_TYPE_READ]; + blk_mq_map_queues(&set->map[HCTX_TYPE_POLL]); + } + dev_info(ctrl->ctrl.device, - "mapped %d/%d default/read queues.\n", + "mapped %d/%d/%d default/read/poll queues.\n", ctrl->io_queues[HCTX_TYPE_DEFAULT], - ctrl->io_queues[HCTX_TYPE_READ]); + ctrl->io_queues[HCTX_TYPE_READ], + ctrl->io_queues[HCTX_TYPE_POLL]); return 0; } +static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx) +{ + struct nvme_tcp_queue *queue = hctx->driver_data; + struct sock *sk = queue->sock->sk; + + if (sk_can_busy_loop(sk) && skb_queue_empty(&sk->sk_receive_queue)) + sk_busy_loop(sk, true); + nvme_tcp_try_recv(queue); + return queue->nr_cqe; +} + static struct blk_mq_ops nvme_tcp_mq_ops = { .queue_rq = nvme_tcp_queue_rq, .complete = nvme_complete_rq, @@ -2167,6 +2222,7 @@ static struct blk_mq_ops nvme_tcp_mq_ops = { .init_hctx = nvme_tcp_init_hctx, .timeout = nvme_tcp_timeout, .map_queues = nvme_tcp_map_queues, + .poll = nvme_tcp_poll, }; static struct blk_mq_ops nvme_tcp_admin_mq_ops = { @@ -2220,7 +2276,8 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev, INIT_LIST_HEAD(&ctrl->list); ctrl->ctrl.opts = opts; - ctrl->ctrl.queue_count = opts->nr_io_queues + opts->nr_write_queues + 1; + ctrl->ctrl.queue_count = opts->nr_io_queues + opts->nr_write_queues + + opts->nr_poll_queues + 1; ctrl->ctrl.sqsize = opts->queue_size - 1; ctrl->ctrl.kato = opts->kato; @@ -2314,7 +2371,8 @@ static struct nvmf_transport_ops nvme_tcp_transport = { .allowed_opts = NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY | NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO | NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST | - NVMF_OPT_NR_WRITE_QUEUES, + NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES | + NVMF_OPT_TOS, .create_ctrl = nvme_tcp_create_ctrl, }; diff --git a/drivers/nvme/host/trace.c b/drivers/nvme/host/trace.c index 9778eb0406b3..5c3cb6928f3c 100644 --- a/drivers/nvme/host/trace.c +++ b/drivers/nvme/host/trace.c @@ -86,6 +86,22 @@ static const char *nvme_trace_admin_get_features(struct trace_seq *p, return ret; } +static const char *nvme_trace_get_lba_status(struct trace_seq *p, + u8 *cdw10) +{ + const char *ret = trace_seq_buffer_ptr(p); + u64 slba = get_unaligned_le64(cdw10); + u32 mndw = get_unaligned_le32(cdw10 + 8); + u16 rl = get_unaligned_le16(cdw10 + 12); + u8 atype = cdw10[15]; + + trace_seq_printf(p, "slba=0x%llx, mndw=0x%x, rl=0x%x, atype=%u", + slba, mndw, rl, atype); + trace_seq_putc(p, 0); + + return ret; +} + static const char *nvme_trace_read_write(struct trace_seq *p, u8 *cdw10) { const char *ret = trace_seq_buffer_ptr(p); @@ -141,6 +157,8 @@ const char *nvme_trace_parse_admin_cmd(struct trace_seq *p, return nvme_trace_admin_identify(p, cdw10); case nvme_admin_get_features: return nvme_trace_admin_get_features(p, cdw10); + case nvme_admin_get_lba_status: + return nvme_trace_get_lba_status(p, cdw10); default: return nvme_trace_common(p, cdw10); } diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 4dc12ea52f23..831a062d27cb 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -37,7 +37,6 @@ static void nvmet_execute_get_log_page_noop(struct nvmet_req *req) static void nvmet_execute_get_log_page_error(struct nvmet_req *req) { struct nvmet_ctrl *ctrl = req->sq->ctrl; - u16 status = NVME_SC_SUCCESS; unsigned long flags; off_t offset = 0; u64 slot; @@ -47,9 +46,8 @@ static void nvmet_execute_get_log_page_error(struct nvmet_req *req) slot = ctrl->err_counter % NVMET_ERROR_LOG_SLOTS; for (i = 0; i < NVMET_ERROR_LOG_SLOTS; i++) { - status = nvmet_copy_to_sgl(req, offset, &ctrl->slots[slot], - sizeof(struct nvme_error_slot)); - if (status) + if (nvmet_copy_to_sgl(req, offset, &ctrl->slots[slot], + sizeof(struct nvme_error_slot))) break; if (slot == 0) @@ -59,7 +57,7 @@ static void nvmet_execute_get_log_page_error(struct nvmet_req *req) offset += sizeof(struct nvme_error_slot); } spin_unlock_irqrestore(&ctrl->error_lock, flags); - nvmet_req_complete(req, status); + nvmet_req_complete(req, 0); } static u16 nvmet_get_smart_log_nsid(struct nvmet_req *req, @@ -81,9 +79,11 @@ static u16 nvmet_get_smart_log_nsid(struct nvmet_req *req, goto out; host_reads = part_stat_read(ns->bdev->bd_part, ios[READ]); - data_units_read = part_stat_read(ns->bdev->bd_part, sectors[READ]); + data_units_read = DIV_ROUND_UP(part_stat_read(ns->bdev->bd_part, + sectors[READ]), 1000); host_writes = part_stat_read(ns->bdev->bd_part, ios[WRITE]); - data_units_written = part_stat_read(ns->bdev->bd_part, sectors[WRITE]); + data_units_written = DIV_ROUND_UP(part_stat_read(ns->bdev->bd_part, + sectors[WRITE]), 1000); put_unaligned_le64(host_reads, &slog->host_reads[0]); put_unaligned_le64(data_units_read, &slog->data_units_read[0]); @@ -111,11 +111,11 @@ static u16 nvmet_get_smart_log_all(struct nvmet_req *req, if (!ns->bdev) continue; host_reads += part_stat_read(ns->bdev->bd_part, ios[READ]); - data_units_read += - part_stat_read(ns->bdev->bd_part, sectors[READ]); + data_units_read += DIV_ROUND_UP( + part_stat_read(ns->bdev->bd_part, sectors[READ]), 1000); host_writes += part_stat_read(ns->bdev->bd_part, ios[WRITE]); - data_units_written += - part_stat_read(ns->bdev->bd_part, sectors[WRITE]); + data_units_written += DIV_ROUND_UP( + part_stat_read(ns->bdev->bd_part, sectors[WRITE]), 1000); } rcu_read_unlock(); diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c index 8efca26b4776..3764a8900850 100644 --- a/drivers/nvme/target/discovery.c +++ b/drivers/nvme/target/discovery.c @@ -381,9 +381,7 @@ int __init nvmet_init_discovery(void) { nvmet_disc_subsys = nvmet_subsys_alloc(NVME_DISC_SUBSYS_NAME, NVME_NQN_DISC); - if (IS_ERR(nvmet_disc_subsys)) - return PTR_ERR(nvmet_disc_subsys); - return 0; + return PTR_ERR_OR_ZERO(nvmet_disc_subsys); } void nvmet_exit_discovery(void) diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 0940c5024a34..748a39fca771 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -253,6 +253,7 @@ static void nvme_loop_destroy_admin_queue(struct nvme_loop_ctrl *ctrl) clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags); nvmet_sq_destroy(&ctrl->queues[0].nvme_sq); blk_cleanup_queue(ctrl->ctrl.admin_q); + blk_cleanup_queue(ctrl->ctrl.fabrics_q); blk_mq_free_tag_set(&ctrl->admin_tag_set); } @@ -357,10 +358,16 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) goto out_free_sq; ctrl->ctrl.admin_tagset = &ctrl->admin_tag_set; + ctrl->ctrl.fabrics_q = blk_mq_init_queue(&ctrl->admin_tag_set); + if (IS_ERR(ctrl->ctrl.fabrics_q)) { + error = PTR_ERR(ctrl->ctrl.fabrics_q); + goto out_free_tagset; + } + ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set); if (IS_ERR(ctrl->ctrl.admin_q)) { error = PTR_ERR(ctrl->ctrl.admin_q); - goto out_free_tagset; + goto out_cleanup_fabrics_q; } error = nvmf_connect_admin_queue(&ctrl->ctrl); @@ -369,23 +376,15 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) set_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags); - error = nvmf_reg_read64(&ctrl->ctrl, NVME_REG_CAP, &ctrl->ctrl.cap); - if (error) { - dev_err(ctrl->ctrl.device, - "prop_get NVME_REG_CAP failed\n"); - goto out_cleanup_queue; - } - - ctrl->ctrl.sqsize = - min_t(int, NVME_CAP_MQES(ctrl->ctrl.cap), ctrl->ctrl.sqsize); - - error = nvme_enable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); + error = nvme_enable_ctrl(&ctrl->ctrl); if (error) goto out_cleanup_queue; ctrl->ctrl.max_hw_sectors = (NVME_LOOP_MAX_SEGMENTS - 1) << (PAGE_SHIFT - 9); + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + error = nvme_init_identify(&ctrl->ctrl); if (error) goto out_cleanup_queue; @@ -394,6 +393,8 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) out_cleanup_queue: blk_cleanup_queue(ctrl->ctrl.admin_q); +out_cleanup_fabrics_q: + blk_cleanup_queue(ctrl->ctrl.fabrics_q); out_free_tagset: blk_mq_free_tag_set(&ctrl->admin_tag_set); out_free_sq: @@ -407,16 +408,17 @@ static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl) nvme_stop_queues(&ctrl->ctrl); blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request, &ctrl->ctrl); + blk_mq_tagset_wait_completed_request(&ctrl->tag_set); nvme_loop_destroy_io_queues(ctrl); } + blk_mq_quiesce_queue(ctrl->ctrl.admin_q); if (ctrl->ctrl.state == NVME_CTRL_LIVE) nvme_shutdown_ctrl(&ctrl->ctrl); - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request, &ctrl->ctrl); - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + blk_mq_tagset_wait_completed_request(&ctrl->admin_tag_set); nvme_loop_destroy_admin_queue(ctrl); } diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 69b83fa0c76c..bf4f03474e89 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -348,7 +348,8 @@ static int nvmet_tcp_map_data(struct nvmet_tcp_cmd *cmd) return 0; err: - sgl_free(cmd->req.sg); + if (cmd->req.sg_cnt) + sgl_free(cmd->req.sg); return NVME_SC_INTERNAL; } @@ -553,7 +554,8 @@ static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd) if (queue->nvme_sq.sqhd_disabled) { kfree(cmd->iov); - sgl_free(cmd->req.sg); + if (cmd->req.sg_cnt) + sgl_free(cmd->req.sg); } return 1; @@ -584,7 +586,8 @@ static int nvmet_try_send_response(struct nvmet_tcp_cmd *cmd, return -EAGAIN; kfree(cmd->iov); - sgl_free(cmd->req.sg); + if (cmd->req.sg_cnt) + sgl_free(cmd->req.sg); cmd->queue->snd_cmd = NULL; nvmet_tcp_put_cmd(cmd); return 1; @@ -1306,7 +1309,9 @@ static void nvmet_tcp_finish_cmd(struct nvmet_tcp_cmd *cmd) { nvmet_req_uninit(&cmd->req); nvmet_tcp_unmap_pdu_iovec(cmd); - sgl_free(cmd->req.sg); + kfree(cmd->iov); + if (cmd->req.sg_cnt) + sgl_free(cmd->req.sg); } static void nvmet_tcp_uninit_data_in_cmds(struct nvmet_tcp_queue *queue) @@ -1410,6 +1415,7 @@ done: static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue) { struct socket *sock = queue->sock; + struct inet_sock *inet = inet_sk(sock->sk); struct linger sol = { .l_onoff = 1, .l_linger = 0 }; int ret; @@ -1433,6 +1439,16 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue) if (ret) return ret; + /* Set socket type of service */ + if (inet->rcv_tos > 0) { + int tos = inet->rcv_tos; + + ret = kernel_setsockopt(sock, SOL_IP, IP_TOS, + (char *)&tos, sizeof(tos)); + if (ret) + return ret; + } + write_lock_bh(&sock->sk->sk_callback_lock); sock->sk->sk_user_data = queue; queue->data_ready = sock->sk->sk_data_ready; diff --git a/drivers/nvme/target/trace.c b/drivers/nvme/target/trace.c index 6af11d493271..1373a3c67962 100644 --- a/drivers/nvme/target/trace.c +++ b/drivers/nvme/target/trace.c @@ -33,6 +33,22 @@ static const char *nvmet_trace_admin_get_features(struct trace_seq *p, return ret; } +static const char *nvmet_trace_get_lba_status(struct trace_seq *p, + u8 *cdw10) +{ + const char *ret = trace_seq_buffer_ptr(p); + u64 slba = get_unaligned_le64(cdw10); + u32 mndw = get_unaligned_le32(cdw10 + 8); + u16 rl = get_unaligned_le16(cdw10 + 12); + u8 atype = cdw10[15]; + + trace_seq_printf(p, "slba=0x%llx, mndw=0x%x, rl=0x%x, atype=%u", + slba, mndw, rl, atype); + trace_seq_putc(p, 0); + + return ret; +} + static const char *nvmet_trace_read_write(struct trace_seq *p, u8 *cdw10) { const char *ret = trace_seq_buffer_ptr(p); @@ -80,6 +96,8 @@ const char *nvmet_trace_parse_admin_cmd(struct trace_seq *p, return nvmet_trace_admin_identify(p, cdw10); case nvme_admin_get_features: return nvmet_trace_admin_get_features(p, cdw10); + case nvme_admin_get_lba_status: + return nvmet_trace_get_lba_status(p, cdw10); default: return nvmet_trace_common(p, cdw10); } diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 11e64b50497f..4e88d7e9cf9a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1089,6 +1089,18 @@ static void scsi_initialize_rq(struct request *rq) cmd->retries = 0; } +/* + * Only called when the request isn't completed by SCSI, and not freed by + * SCSI + */ +static void scsi_cleanup_rq(struct request *rq) +{ + if (rq->rq_flags & RQF_DONTPREP) { + scsi_mq_uninit_cmd(blk_mq_rq_to_pdu(rq)); + rq->rq_flags &= ~RQF_DONTPREP; + } +} + /* Add a command to the list used by the aacraid and dpt_i2o drivers */ void scsi_add_cmd_to_list(struct scsi_cmnd *cmd) { @@ -1821,6 +1833,7 @@ static const struct blk_mq_ops scsi_mq_ops = { .init_request = scsi_mq_init_request, .exit_request = scsi_mq_exit_request, .initialize_rq_fn = scsi_initialize_rq, + .cleanup_rq = scsi_cleanup_rq, .busy = scsi_mq_lld_busy, .map_queues = scsi_map_queues, }; diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c index 74ded5f3c236..3717eea37ecb 100644 --- a/drivers/scsi/scsi_pm.c +++ b/drivers/scsi/scsi_pm.c @@ -94,8 +94,7 @@ static int scsi_dev_type_resume(struct device *dev, if (!err && scsi_is_sdev_device(dev)) { struct scsi_device *sdev = to_scsi_device(dev); - if (sdev->request_queue->dev) - blk_set_runtime_active(sdev->request_queue); + blk_set_runtime_active(sdev->request_queue); } } diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 149d406aacc9..4b925552458f 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1293,7 +1293,9 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd) case REQ_OP_WRITE: return sd_setup_read_write_cmnd(cmd); case REQ_OP_ZONE_RESET: - return sd_zbc_setup_reset_cmnd(cmd); + return sd_zbc_setup_reset_cmnd(cmd, false); + case REQ_OP_ZONE_RESET_ALL: + return sd_zbc_setup_reset_cmnd(cmd, true); default: WARN_ON_ONCE(1); return BLK_STS_NOTSUPP; @@ -1959,6 +1961,7 @@ static int sd_done(struct scsi_cmnd *SCpnt) case REQ_OP_WRITE_ZEROES: case REQ_OP_WRITE_SAME: case REQ_OP_ZONE_RESET: + case REQ_OP_ZONE_RESET_ALL: if (!result) { good_bytes = blk_rq_bytes(req); scsi_set_resid(SCpnt, 0); diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 38c50946fc42..1eab779f812b 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -209,7 +209,7 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp) extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer); extern void sd_zbc_print_zones(struct scsi_disk *sdkp); -extern blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd); +extern blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd, bool all); extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, struct scsi_sense_hdr *sshdr); extern int sd_zbc_report_zones(struct gendisk *disk, sector_t sector, @@ -225,7 +225,8 @@ static inline int sd_zbc_read_zones(struct scsi_disk *sdkp, static inline void sd_zbc_print_zones(struct scsi_disk *sdkp) {} -static inline blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) +static inline blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd, + bool all) { return BLK_STS_TARGET; } diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 5d6ff3931632..de4019dc0f0b 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -209,10 +209,11 @@ static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp) /** * sd_zbc_setup_reset_cmnd - Prepare a RESET WRITE POINTER scsi command. * @cmd: the command to setup + * @all: Reset all zones control. * * Called from sd_init_command() for a REQ_OP_ZONE_RESET request. */ -blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) +blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd, bool all) { struct request *rq = cmd->request; struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); @@ -234,7 +235,10 @@ blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) memset(cmd->cmnd, 0, cmd->cmd_len); cmd->cmnd[0] = ZBC_OUT; cmd->cmnd[1] = ZO_RESET_WRITE_POINTER; - put_unaligned_be64(block, &cmd->cmnd[2]); + if (all) + cmd->cmnd[14] = 0x1; + else + put_unaligned_be64(block, &cmd->cmnd[2]); rq->timeout = SD_TIMEOUT; cmd->sc_data_direction = DMA_NONE; @@ -261,6 +265,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, switch (req_op(rq)) { case REQ_OP_ZONE_RESET: + case REQ_OP_ZONE_RESET_ALL: if (result && sshdr->sense_key == ILLEGAL_REQUEST && @@ -487,6 +492,9 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf) /* The drive satisfies the kernel restrictions: set it up */ blk_queue_chunk_sectors(sdkp->disk->queue, logical_to_sectors(sdkp->device, zone_blocks)); + blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, sdkp->disk->queue); + blk_queue_required_elevator_features(sdkp->disk->queue, + ELEVATOR_F_ZBD_SEQ_WRITE); nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks); /* READ16/WRITE16 is mandatory for ZBC disks */ diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 542b02d170f8..8aaa7eec7b74 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -36,10 +36,6 @@ */ #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_SHIFT - 10)) -struct wb_completion { - atomic_t cnt; -}; - /* * Passed into wb_writeback(), essentially a subset of writeback_control */ @@ -61,19 +57,6 @@ struct wb_writeback_work { }; /* - * If one wants to wait for one or more wb_writeback_works, each work's - * ->done should be set to a wb_completion defined using the following - * macro. Once all work items are issued with wb_queue_work(), the caller - * can wait for the completion of all using wb_wait_for_completion(). Work - * items which are waited upon aren't freed automatically on completion. - */ -#define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \ - struct wb_completion cmpl = { \ - .cnt = ATOMIC_INIT(1), \ - } - - -/* * If an inode is constantly having its pages dirtied, but then the * updates stop dirtytime_expire_interval seconds in the past, it's * possible for the worst case time between when an inode has its @@ -182,7 +165,7 @@ static void finish_writeback_work(struct bdi_writeback *wb, if (work->auto_free) kfree(work); if (done && atomic_dec_and_test(&done->cnt)) - wake_up_all(&wb->bdi->wb_waitq); + wake_up_all(done->waitq); } static void wb_queue_work(struct bdi_writeback *wb, @@ -206,28 +189,44 @@ static void wb_queue_work(struct bdi_writeback *wb, /** * wb_wait_for_completion - wait for completion of bdi_writeback_works - * @bdi: bdi work items were issued to * @done: target wb_completion * * Wait for one or more work items issued to @bdi with their ->done field - * set to @done, which should have been defined with - * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such - * work items are completed. Work items which are waited upon aren't freed + * set to @done, which should have been initialized with + * DEFINE_WB_COMPLETION(). This function returns after all such work items + * are completed. Work items which are waited upon aren't freed * automatically on completion. */ -static void wb_wait_for_completion(struct backing_dev_info *bdi, - struct wb_completion *done) +void wb_wait_for_completion(struct wb_completion *done) { atomic_dec(&done->cnt); /* put down the initial count */ - wait_event(bdi->wb_waitq, !atomic_read(&done->cnt)); + wait_event(*done->waitq, !atomic_read(&done->cnt)); } #ifdef CONFIG_CGROUP_WRITEBACK -/* parameters for foreign inode detection, see wb_detach_inode() */ +/* + * Parameters for foreign inode detection, see wbc_detach_inode() to see + * how they're used. + * + * These paramters are inherently heuristical as the detection target + * itself is fuzzy. All we want to do is detaching an inode from the + * current owner if it's being written to by some other cgroups too much. + * + * The current cgroup writeback is built on the assumption that multiple + * cgroups writing to the same inode concurrently is very rare and a mode + * of operation which isn't well supported. As such, the goal is not + * taking too long when a different cgroup takes over an inode while + * avoiding too aggressive flip-flops from occasional foreign writes. + * + * We record, very roughly, 2s worth of IO time history and if more than + * half of that is foreign, trigger the switch. The recording is quantized + * to 16 slots. To avoid tiny writes from swinging the decision too much, + * writes smaller than 1/8 of avg size are ignored. + */ #define WB_FRN_TIME_SHIFT 13 /* 1s = 2^13, upto 8 secs w/ 16bit */ #define WB_FRN_TIME_AVG_SHIFT 3 /* avg = avg * 7/8 + new * 1/8 */ -#define WB_FRN_TIME_CUT_DIV 2 /* ignore rounds < avg / 2 */ +#define WB_FRN_TIME_CUT_DIV 8 /* ignore rounds < avg / 8 */ #define WB_FRN_TIME_PERIOD (2 * (1 << WB_FRN_TIME_SHIFT)) /* 2s */ #define WB_FRN_HIST_SLOTS 16 /* inode->i_wb_frn_history is 16bit */ @@ -237,6 +236,7 @@ static void wb_wait_for_completion(struct backing_dev_info *bdi, /* if foreign slots >= 8, switch */ #define WB_FRN_HIST_MAX_SLOTS (WB_FRN_HIST_THR_SLOTS / 2 + 1) /* one round can affect upto 5 slots */ +#define WB_FRN_MAX_IN_FLIGHT 1024 /* don't queue too many concurrently */ static atomic_t isw_nr_in_flight = ATOMIC_INIT(0); static struct workqueue_struct *isw_wq; @@ -389,6 +389,8 @@ static void inode_switch_wbs_work_fn(struct work_struct *work) if (unlikely(inode->i_state & I_FREEING)) goto skip_switch; + trace_inode_switch_wbs(inode, old_wb, new_wb); + /* * Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points * to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to @@ -489,18 +491,13 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) if (inode->i_state & I_WB_SWITCH) return; - /* - * Avoid starting new switches while sync_inodes_sb() is in - * progress. Otherwise, if the down_write protected issue path - * blocks heavily, we might end up starting a large number of - * switches which will block on the rwsem. - */ - if (!down_read_trylock(&bdi->wb_switch_rwsem)) + /* avoid queueing a new switch if too many are already in flight */ + if (atomic_read(&isw_nr_in_flight) > WB_FRN_MAX_IN_FLIGHT) return; isw = kzalloc(sizeof(*isw), GFP_ATOMIC); if (!isw) - goto out_unlock; + return; /* find and pin the new wb */ rcu_read_lock(); @@ -534,15 +531,12 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn); atomic_inc(&isw_nr_in_flight); - - goto out_unlock; + return; out_free: if (isw->new_wb) wb_put(isw->new_wb); kfree(isw); -out_unlock: - up_read(&bdi->wb_switch_rwsem); } /** @@ -681,6 +675,9 @@ void wbc_detach_inode(struct writeback_control *wbc) if (wbc->wb_id != max_id) history |= (1U << slots) - 1; + if (history) + trace_inode_foreign_history(inode, wbc, history); + /* * Switch if the current wb isn't the consistent winner. * If there are multiple closely competing dirtiers, the @@ -843,7 +840,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, restart: rcu_read_lock(); list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) { - DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done); + DEFINE_WB_COMPLETION(fallback_work_done, bdi); struct wb_writeback_work fallback_work; struct wb_writeback_work *work; long nr_pages; @@ -890,7 +887,7 @@ restart: last_wb = wb; rcu_read_unlock(); - wb_wait_for_completion(bdi, &fallback_work_done); + wb_wait_for_completion(&fallback_work_done); goto restart; } rcu_read_unlock(); @@ -900,6 +897,89 @@ restart: } /** + * cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs + * @bdi_id: target bdi id + * @memcg_id: target memcg css id + * @nr_pages: number of pages to write, 0 for best-effort dirty flushing + * @reason: reason why some writeback work initiated + * @done: target wb_completion + * + * Initiate flush of the bdi_writeback identified by @bdi_id and @memcg_id + * with the specified parameters. + */ +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr, + enum wb_reason reason, struct wb_completion *done) +{ + struct backing_dev_info *bdi; + struct cgroup_subsys_state *memcg_css; + struct bdi_writeback *wb; + struct wb_writeback_work *work; + int ret; + + /* lookup bdi and memcg */ + bdi = bdi_get_by_id(bdi_id); + if (!bdi) + return -ENOENT; + + rcu_read_lock(); + memcg_css = css_from_id(memcg_id, &memory_cgrp_subsys); + if (memcg_css && !css_tryget(memcg_css)) + memcg_css = NULL; + rcu_read_unlock(); + if (!memcg_css) { + ret = -ENOENT; + goto out_bdi_put; + } + + /* + * And find the associated wb. If the wb isn't there already + * there's nothing to flush, don't create one. + */ + wb = wb_get_lookup(bdi, memcg_css); + if (!wb) { + ret = -ENOENT; + goto out_css_put; + } + + /* + * If @nr is zero, the caller is attempting to write out most of + * the currently dirty pages. Let's take the current dirty page + * count and inflate it by 25% which should be large enough to + * flush out most dirty pages while avoiding getting livelocked by + * concurrent dirtiers. + */ + if (!nr) { + unsigned long filepages, headroom, dirty, writeback; + + mem_cgroup_wb_stats(wb, &filepages, &headroom, &dirty, + &writeback); + nr = dirty * 10 / 8; + } + + /* issue the writeback work */ + work = kzalloc(sizeof(*work), GFP_NOWAIT | __GFP_NOWARN); + if (work) { + work->nr_pages = nr; + work->sync_mode = WB_SYNC_NONE; + work->range_cyclic = 1; + work->reason = reason; + work->done = done; + work->auto_free = 1; + wb_queue_work(wb, work); + ret = 0; + } else { + ret = -ENOMEM; + } + + wb_put(wb); +out_css_put: + css_put(memcg_css); +out_bdi_put: + bdi_put(bdi); + return ret; +} + +/** * cgroup_writeback_umount - flush inode wb switches for umount * * This function is called when a super_block is about to be destroyed and @@ -2362,7 +2442,8 @@ static void wait_sb_inodes(struct super_block *sb) static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, enum wb_reason reason, bool skip_if_busy) { - DEFINE_WB_COMPLETION_ONSTACK(done); + struct backing_dev_info *bdi = sb->s_bdi; + DEFINE_WB_COMPLETION(done, bdi); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_NONE, @@ -2371,14 +2452,13 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, .nr_pages = nr, .reason = reason, }; - struct backing_dev_info *bdi = sb->s_bdi; if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); - wb_wait_for_completion(bdi, &done); + wb_wait_for_completion(&done); } /** @@ -2440,7 +2520,8 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb); */ void sync_inodes_sb(struct super_block *sb) { - DEFINE_WB_COMPLETION_ONSTACK(done); + struct backing_dev_info *bdi = sb->s_bdi; + DEFINE_WB_COMPLETION(done, bdi); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_ALL, @@ -2450,7 +2531,6 @@ void sync_inodes_sb(struct super_block *sb) .reason = WB_REASON_SYNC, .for_sync = 1, }; - struct backing_dev_info *bdi = sb->s_bdi; /* * Can't skip on !bdi_has_dirty() because we should wait for !dirty @@ -2464,7 +2544,7 @@ void sync_inodes_sb(struct super_block *sb) /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ bdi_down_write_wb_switch_rwsem(bdi); bdi_split_work_to_wbs(bdi, &work, false); - wb_wait_for_completion(bdi, &done); + wb_wait_for_completion(&done); bdi_up_write_wb_switch_rwsem(bdi); wait_sb_inodes(sb); diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 6a1a8a314d85..4fc87dee005a 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -63,10 +63,31 @@ enum wb_reason { * so it has a mismatch name. */ WB_REASON_FORKER_THREAD, + WB_REASON_FOREIGN_FLUSH, WB_REASON_MAX, }; +struct wb_completion { + atomic_t cnt; + wait_queue_head_t *waitq; +}; + +#define __WB_COMPLETION_INIT(_waitq) \ + (struct wb_completion){ .cnt = ATOMIC_INIT(1), .waitq = (_waitq) } + +/* + * If one wants to wait for one or more wb_writeback_works, each work's + * ->done should be set to a wb_completion defined using the following + * macro. Once all work items are issued with wb_queue_work(), the caller + * can wait for the completion of all using wb_wait_for_completion(). Work + * items which are waited upon aren't freed automatically on completion. + */ +#define WB_COMPLETION_INIT(bdi) __WB_COMPLETION_INIT(&(bdi)->wb_waitq) + +#define DEFINE_WB_COMPLETION(cmpl, bdi) \ + struct wb_completion cmpl = WB_COMPLETION_INIT(bdi) + /* * For cgroup writeback, multiple wb's may map to the same blkcg. Those * wb's can operate mostly independently but should share the congested @@ -165,6 +186,8 @@ struct bdi_writeback { }; struct backing_dev_info { + u64 id; + struct rb_node rb_node; /* keyed by ->id */ struct list_head bdi_list; unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ unsigned long io_pages; /* max allowed IO size */ diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 35b31d176f74..97967ce06de3 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -24,6 +24,7 @@ static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi) return bdi; } +struct backing_dev_info *bdi_get_by_id(u64 id); void bdi_put(struct backing_dev_info *bdi); __printf(2, 3) @@ -44,6 +45,8 @@ void wb_start_background_writeback(struct bdi_writeback *wb); void wb_workfn(struct work_struct *work); void wb_wakeup_delayed(struct bdi_writeback *wb); +void wb_wait_for_completion(struct wb_completion *done); + extern spinlock_t bdi_lock; extern struct list_head bdi_list; @@ -227,6 +230,8 @@ static inline int bdi_sched_wait(void *word) struct bdi_writeback_congested * wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp); void wb_congested_put(struct bdi_writeback_congested *congested); +struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css); struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp); diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 12811091fd50..bed9e43f9426 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -149,7 +149,8 @@ typedef struct blkcg_policy_data *(blkcg_pol_alloc_cpd_fn)(gfp_t gfp); typedef void (blkcg_pol_init_cpd_fn)(struct blkcg_policy_data *cpd); typedef void (blkcg_pol_free_cpd_fn)(struct blkcg_policy_data *cpd); typedef void (blkcg_pol_bind_cpd_fn)(struct blkcg_policy_data *cpd); -typedef struct blkg_policy_data *(blkcg_pol_alloc_pd_fn)(gfp_t gfp, int node); +typedef struct blkg_policy_data *(blkcg_pol_alloc_pd_fn)(gfp_t gfp, + struct request_queue *q, struct blkcg *blkcg); typedef void (blkcg_pol_init_pd_fn)(struct blkg_policy_data *pd); typedef void (blkcg_pol_online_pd_fn)(struct blkg_policy_data *pd); typedef void (blkcg_pol_offline_pd_fn)(struct blkg_policy_data *pd); @@ -233,6 +234,7 @@ struct blkg_conf_ctx { char *body; }; +struct gendisk *blkcg_conf_get_disk(char **inputp); int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, char *input, struct blkg_conf_ctx *ctx); void blkg_conf_finish(struct blkg_conf_ctx *ctx); @@ -375,7 +377,7 @@ static inline struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, * @q: request_queue of interest * * Lookup blkg for the @blkcg - @q pair. This function should be called - * under RCU read loc. + * under RCU read lock. */ static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q) diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 3fa1fa59f9b2..0bf056de5cc3 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -140,6 +140,7 @@ typedef int (poll_fn)(struct blk_mq_hw_ctx *); typedef int (map_queues_fn)(struct blk_mq_tag_set *set); typedef bool (busy_fn)(struct request_queue *); typedef void (complete_fn)(struct request *); +typedef void (cleanup_rq_fn)(struct request *); struct blk_mq_ops { @@ -201,6 +202,12 @@ struct blk_mq_ops { void (*initialize_rq_fn)(struct request *rq); /* + * Called before freeing one request which isn't completed yet, + * and usually for freeing the driver private data + */ + cleanup_rq_fn *cleanup_rq; + + /* * If set, returns whether or not this queue currently is busy */ busy_fn *busy; @@ -241,12 +248,12 @@ enum { struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *); struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, - struct request_queue *q); + struct request_queue *q, + bool elevator_init); struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set, const struct blk_mq_ops *ops, unsigned int queue_depth, unsigned int set_flags); -int blk_mq_register_dev(struct device *, struct request_queue *); void blk_mq_unregister_dev(struct device *, struct request_queue *); int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set); @@ -296,6 +303,7 @@ static inline u16 blk_mq_unique_tag_to_tag(u32 unique_tag) int blk_mq_request_started(struct request *rq); +int blk_mq_request_completed(struct request *rq); void blk_mq_start_request(struct request *rq); void blk_mq_end_request(struct request *rq, blk_status_t error); void __blk_mq_end_request(struct request *rq, blk_status_t error); @@ -304,7 +312,6 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list); void blk_mq_kick_requeue_list(struct request_queue *q); void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs); bool blk_mq_complete_request(struct request *rq); -void blk_mq_complete_request_sync(struct request *rq); bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list, struct bio *bio, unsigned int nr_segs); bool blk_mq_queue_stopped(struct request_queue *q); @@ -321,6 +328,7 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); void blk_mq_run_hw_queues(struct request_queue *q, bool async); void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv); +void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset); void blk_mq_freeze_queue(struct request_queue *q); void blk_mq_unfreeze_queue(struct request_queue *q); void blk_freeze_queue_start(struct request_queue *q); @@ -366,4 +374,10 @@ static inline blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, BLK_QC_T_INTERNAL; } +static inline void blk_mq_cleanup_rq(struct request *rq) +{ + if (rq->q->mq_ops->cleanup_rq) + rq->q->mq_ops->cleanup_rq(rq); +} + #endif diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 1b1fa1557e68..d688b96d1d63 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -169,6 +169,9 @@ struct bio { */ struct blkcg_gq *bi_blkg; struct bio_issue bi_issue; +#ifdef CONFIG_BLK_CGROUP_IOCOST + u64 bi_iocost_cost; +#endif #endif union { #if defined(CONFIG_BLK_DEV_INTEGRITY) @@ -209,6 +212,7 @@ enum { BIO_BOUNCED, /* bio is a bounce bio */ BIO_USER_MAPPED, /* contains user pages */ BIO_NULL_MAPPED, /* contains invalid user pages */ + BIO_WORKINGSET, /* contains userspace workingset pages */ BIO_QUIET, /* Make BIO Quiet */ BIO_CHAIN, /* chained bio, ->bi_remaining in effect */ BIO_REFFED, /* bio has elevated ->bi_cnt */ @@ -282,6 +286,8 @@ enum req_opf { REQ_OP_ZONE_RESET = 6, /* write the same sector many times */ REQ_OP_WRITE_SAME = 7, + /* reset all the zone present on the device */ + REQ_OP_ZONE_RESET_ALL = 8, /* write the zero filled sector many times */ REQ_OP_WRITE_ZEROES = 9, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 1ef375dafb1c..3094f2d513b2 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -194,7 +194,11 @@ struct request { struct gendisk *rq_disk; struct hd_struct *part; - /* Time that I/O was submitted to the kernel. */ +#ifdef CONFIG_BLK_RQ_ALLOC_TIME + /* Time that the first bio started allocating this request. */ + u64 alloc_time_ns; +#endif + /* Time that this request was allocated for this IO. */ u64 start_time_ns; /* Time that I/O was submitted to the device. */ u64 io_start_time_ns; @@ -202,9 +206,12 @@ struct request { #ifdef CONFIG_BLK_WBT unsigned short wbt_flags; #endif -#ifdef CONFIG_BLK_DEV_THROTTLING_LOW - unsigned short throtl_size; -#endif + /* + * rq sectors used for blk stats. It has the same value + * with blk_rq_sectors(rq), except that it never be zeroed + * by completion. + */ + unsigned short stats_sectors; /* * Number of scatter-gather DMA addr+len pairs after @@ -391,10 +398,6 @@ static inline int blkdev_reset_zones_ioctl(struct block_device *bdev, #endif /* CONFIG_BLK_DEV_ZONED */ struct request_queue { - /* - * Together with queue_head for cacheline sharing - */ - struct list_head queue_head; struct request *last_merge; struct elevator_queue *elevator; @@ -496,6 +499,8 @@ struct request_queue { struct queue_limits limits; + unsigned int required_elevator_features; + #ifdef CONFIG_BLK_DEV_ZONED /* * Zoned block device information for request dispatch control. @@ -539,6 +544,7 @@ struct request_queue { struct delayed_work requeue_work; struct mutex sysfs_lock; + struct mutex sysfs_dir_lock; /* * for reusing dead hctx instance in case of updating @@ -611,6 +617,8 @@ struct request_queue { #define QUEUE_FLAG_SCSI_PASSTHROUGH 23 /* queue supports SCSI commands */ #define QUEUE_FLAG_QUIESCED 24 /* queue has been quiesced */ #define QUEUE_FLAG_PCI_P2PDMA 25 /* device supports PCI p2p requests */ +#define QUEUE_FLAG_ZONE_RESETALL 26 /* supports Zone Reset All */ +#define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */ #define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_SAME_COMP)) @@ -630,6 +638,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags) #define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags) #define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags) +#define blk_queue_zone_resetall(q) \ + test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags) #define blk_queue_secure_erase(q) \ (test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags)) #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) @@ -637,6 +647,12 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) #define blk_queue_pci_p2pdma(q) \ test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags) +#ifdef CONFIG_BLK_RQ_ALLOC_TIME +#define blk_queue_rq_alloc_time(q) \ + test_bit(QUEUE_FLAG_RQ_ALLOC_TIME, &(q)->queue_flags) +#else +#define blk_queue_rq_alloc_time(q) false +#endif #define blk_noretry_request(rq) \ ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ @@ -644,6 +660,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); #define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags) #define blk_queue_pm_only(q) atomic_read(&(q)->pm_only) #define blk_queue_fua(q) test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags) +#define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags) extern void blk_set_pm_only(struct request_queue *q); extern void blk_clear_pm_only(struct request_queue *q); @@ -903,6 +920,7 @@ static inline struct request_queue *bdev_get_queue(struct block_device *bdev) * blk_rq_err_bytes() : bytes left till the next error boundary * blk_rq_sectors() : sectors left in the entire request * blk_rq_cur_sectors() : sectors left in the current segment + * blk_rq_stats_sectors() : sectors of the entire request used for stats */ static inline sector_t blk_rq_pos(const struct request *rq) { @@ -931,6 +949,11 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq) return blk_rq_cur_bytes(rq) >> SECTOR_SHIFT; } +static inline unsigned int blk_rq_stats_sectors(const struct request *rq) +{ + return rq->stats_sectors; +} + #ifdef CONFIG_BLK_DEV_ZONED static inline unsigned int blk_rq_zone_no(struct request *rq) { @@ -1085,6 +1108,8 @@ extern void blk_queue_dma_alignment(struct request_queue *, int); extern void blk_queue_update_dma_alignment(struct request_queue *, int); extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua); +extern void blk_queue_required_elevator_features(struct request_queue *q, + unsigned int features); /* * Number of physical segments as sent to the device. @@ -1232,42 +1257,42 @@ enum blk_default_limits { BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL, }; -static inline unsigned long queue_segment_boundary(struct request_queue *q) +static inline unsigned long queue_segment_boundary(const struct request_queue *q) { return q->limits.seg_boundary_mask; } -static inline unsigned long queue_virt_boundary(struct request_queue *q) +static inline unsigned long queue_virt_boundary(const struct request_queue *q) { return q->limits.virt_boundary_mask; } -static inline unsigned int queue_max_sectors(struct request_queue *q) +static inline unsigned int queue_max_sectors(const struct request_queue *q) { return q->limits.max_sectors; } -static inline unsigned int queue_max_hw_sectors(struct request_queue *q) +static inline unsigned int queue_max_hw_sectors(const struct request_queue *q) { return q->limits.max_hw_sectors; } -static inline unsigned short queue_max_segments(struct request_queue *q) +static inline unsigned short queue_max_segments(const struct request_queue *q) { return q->limits.max_segments; } -static inline unsigned short queue_max_discard_segments(struct request_queue *q) +static inline unsigned short queue_max_discard_segments(const struct request_queue *q) { return q->limits.max_discard_segments; } -static inline unsigned int queue_max_segment_size(struct request_queue *q) +static inline unsigned int queue_max_segment_size(const struct request_queue *q) { return q->limits.max_segment_size; } -static inline unsigned short queue_logical_block_size(struct request_queue *q) +static inline unsigned short queue_logical_block_size(const struct request_queue *q) { int retval = 512; @@ -1282,7 +1307,7 @@ static inline unsigned short bdev_logical_block_size(struct block_device *bdev) return queue_logical_block_size(bdev_get_queue(bdev)); } -static inline unsigned int queue_physical_block_size(struct request_queue *q) +static inline unsigned int queue_physical_block_size(const struct request_queue *q) { return q->limits.physical_block_size; } @@ -1292,7 +1317,7 @@ static inline unsigned int bdev_physical_block_size(struct block_device *bdev) return queue_physical_block_size(bdev_get_queue(bdev)); } -static inline unsigned int queue_io_min(struct request_queue *q) +static inline unsigned int queue_io_min(const struct request_queue *q) { return q->limits.io_min; } @@ -1302,7 +1327,7 @@ static inline int bdev_io_min(struct block_device *bdev) return queue_io_min(bdev_get_queue(bdev)); } -static inline unsigned int queue_io_opt(struct request_queue *q) +static inline unsigned int queue_io_opt(const struct request_queue *q) { return q->limits.io_opt; } @@ -1312,7 +1337,7 @@ static inline int bdev_io_opt(struct block_device *bdev) return queue_io_opt(bdev_get_queue(bdev)); } -static inline int queue_alignment_offset(struct request_queue *q) +static inline int queue_alignment_offset(const struct request_queue *q) { if (q->limits.misaligned) return -1; @@ -1342,7 +1367,7 @@ static inline int bdev_alignment_offset(struct block_device *bdev) return q->limits.alignment_offset; } -static inline int queue_discard_alignment(struct request_queue *q) +static inline int queue_discard_alignment(const struct request_queue *q) { if (q->limits.discard_misaligned) return -1; @@ -1432,7 +1457,7 @@ static inline sector_t bdev_zone_sectors(struct block_device *bdev) return 0; } -static inline int queue_dma_alignment(struct request_queue *q) +static inline int queue_dma_alignment(const struct request_queue *q) { return q ? q->dma_alignment : 511; } @@ -1543,7 +1568,7 @@ static inline void blk_queue_max_integrity_segments(struct request_queue *q, } static inline unsigned short -queue_max_integrity_segments(struct request_queue *q) +queue_max_integrity_segments(const struct request_queue *q) { return q->limits.max_integrity_segments; } @@ -1626,7 +1651,7 @@ static inline void blk_queue_max_integrity_segments(struct request_queue *q, unsigned int segs) { } -static inline unsigned short queue_max_integrity_segments(struct request_queue *q) +static inline unsigned short queue_max_integrity_segments(const struct request_queue *q) { return 0; } diff --git a/include/linux/elevator.h b/include/linux/elevator.h index 1dd014c9c87b..901bda352dcb 100644 --- a/include/linux/elevator.h +++ b/include/linux/elevator.h @@ -76,6 +76,7 @@ struct elevator_type struct elv_fs_entry *elevator_attrs; const char *elevator_name; const char *elevator_alias; + const unsigned int elevator_features; struct module *elevator_owner; #ifdef CONFIG_BLK_DEBUG_FS const struct blk_mq_debugfs_attr *queue_debugfs_attrs; @@ -165,5 +166,12 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t); #define rq_entry_fifo(ptr) list_entry((ptr), struct request, queuelist) #define rq_fifo_clear(rq) list_del_init(&(rq)->queuelist) +/* + * Elevator features. + */ + +/* Supports zoned block devices sequential write constraint */ +#define ELEVATOR_F_ZBD_SEQ_WRITE (1U << 0) + #endif /* CONFIG_BLOCK */ #endif diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index 4d0d5655c7b2..ee8ec2e68055 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -88,8 +88,7 @@ typedef int (nvm_op_bb_tbl_fn)(struct nvm_dev *, struct ppa_addr, u8 *); typedef int (nvm_op_set_bb_fn)(struct nvm_dev *, struct ppa_addr *, int, int); typedef int (nvm_get_chk_meta_fn)(struct nvm_dev *, sector_t, int, struct nvm_chk_meta *); -typedef int (nvm_submit_io_fn)(struct nvm_dev *, struct nvm_rq *); -typedef int (nvm_submit_io_sync_fn)(struct nvm_dev *, struct nvm_rq *); +typedef int (nvm_submit_io_fn)(struct nvm_dev *, struct nvm_rq *, void *); typedef void *(nvm_create_dma_pool_fn)(struct nvm_dev *, char *, int); typedef void (nvm_destroy_dma_pool_fn)(void *); typedef void *(nvm_dev_dma_alloc_fn)(struct nvm_dev *, void *, gfp_t, @@ -104,7 +103,6 @@ struct nvm_dev_ops { nvm_get_chk_meta_fn *get_chk_meta; nvm_submit_io_fn *submit_io; - nvm_submit_io_sync_fn *submit_io_sync; nvm_create_dma_pool_fn *create_dma_pool; nvm_destroy_dma_pool_fn *destroy_dma_pool; @@ -682,8 +680,8 @@ extern int nvm_get_chunk_meta(struct nvm_tgt_dev *, struct ppa_addr, int, struct nvm_chk_meta *); extern int nvm_set_chunk_meta(struct nvm_tgt_dev *, struct ppa_addr *, int, int); -extern int nvm_submit_io(struct nvm_tgt_dev *, struct nvm_rq *); -extern int nvm_submit_io_sync(struct nvm_tgt_dev *, struct nvm_rq *); +extern int nvm_submit_io(struct nvm_tgt_dev *, struct nvm_rq *, void *); +extern int nvm_submit_io_sync(struct nvm_tgt_dev *, struct nvm_rq *, void *); extern void nvm_end_io(struct nvm_rq *); #else /* CONFIG_NVM */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2cd4359cb38c..ad8f1a397ae4 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -184,6 +184,23 @@ struct memcg_padding { #endif /* + * Remember four most recent foreign writebacks with dirty pages in this + * cgroup. Inode sharing is expected to be uncommon and, even if we miss + * one in a given round, we're likely to catch it later if it keeps + * foreign-dirtying, so a fairly low count should be enough. + * + * See mem_cgroup_track_foreign_dirty_slowpath() for details. + */ +#define MEMCG_CGWB_FRN_CNT 4 + +struct memcg_cgwb_frn { + u64 bdi_id; /* bdi->id of the foreign inode */ + int memcg_id; /* memcg->css.id of foreign inode */ + u64 at; /* jiffies_64 at the time of dirtying */ + struct wb_completion done; /* tracks in-flight foreign writebacks */ +}; + +/* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide * statistics based on the statistics developed by Rik Van Riel for clock-pro, @@ -307,6 +324,7 @@ struct mem_cgroup { #ifdef CONFIG_CGROUP_WRITEBACK struct list_head cgwb_list; struct wb_domain cgwb_domain; + struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; #endif /* List of events which userspace want to receive */ @@ -1237,6 +1255,18 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, unsigned long *pheadroom, unsigned long *pdirty, unsigned long *pwriteback); +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, + struct bdi_writeback *wb); + +static inline void mem_cgroup_track_foreign_dirty(struct page *page, + struct bdi_writeback *wb) +{ + if (unlikely(&page->mem_cgroup->css != wb->memcg_css)) + mem_cgroup_track_foreign_dirty_slowpath(page, wb); +} + +void mem_cgroup_flush_foreign(struct bdi_writeback *wb); + #else /* CONFIG_CGROUP_WRITEBACK */ static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) @@ -1252,6 +1282,15 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, { } +static inline void mem_cgroup_track_foreign_dirty(struct page *page, + struct bdi_writeback *wb) +{ +} + +static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb) +{ +} + #endif /* CONFIG_CGROUP_WRITEBACK */ struct sock; diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 01aa6a6c241d..f61d6906e59d 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -140,6 +140,7 @@ enum { * Submission and Completion Queue Entry Sizes for the NVM command set. * (In bytes and specified as a power of two (2^n)). */ +#define NVME_ADM_SQES 6 #define NVME_NVM_IOSQES 6 #define NVME_NVM_IOCQES 4 @@ -814,6 +815,7 @@ enum nvme_admin_opcode { nvme_admin_security_send = 0x81, nvme_admin_security_recv = 0x82, nvme_admin_sanitize_nvm = 0x84, + nvme_admin_get_lba_status = 0x86, }; #define nvme_admin_opcode_name(opcode) { opcode, #opcode } @@ -840,7 +842,8 @@ enum nvme_admin_opcode { nvme_admin_opcode_name(nvme_admin_format_nvm), \ nvme_admin_opcode_name(nvme_admin_security_send), \ nvme_admin_opcode_name(nvme_admin_security_recv), \ - nvme_admin_opcode_name(nvme_admin_sanitize_nvm)) + nvme_admin_opcode_name(nvme_admin_sanitize_nvm), \ + nvme_admin_opcode_name(nvme_admin_get_lba_status)) enum { NVME_QUEUE_PHYS_CONTIG = (1 << 0), diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 8945aac31392..a19d845dd7eb 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -217,6 +217,8 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc, void wbc_detach_inode(struct writeback_control *wbc); void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, size_t bytes); +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, + enum wb_reason reason, struct wb_completion *done); void cgroup_writeback_umount(void); /** diff --git a/include/trace/events/iocost.h b/include/trace/events/iocost.h new file mode 100644 index 000000000000..7ecaa65b7106 --- /dev/null +++ b/include/trace/events/iocost.h @@ -0,0 +1,178 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM iocost + +struct ioc; +struct ioc_now; +struct ioc_gq; + +#if !defined(_TRACE_BLK_IOCOST_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_BLK_IOCOST_H + +#include <linux/tracepoint.h> + +TRACE_EVENT(iocost_iocg_activate, + + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, + u64 last_period, u64 cur_period, u64 vtime), + + TP_ARGS(iocg, path, now, last_period, cur_period, vtime), + + TP_STRUCT__entry ( + __string(devname, ioc_name(iocg->ioc)) + __string(cgroup, path) + __field(u64, now) + __field(u64, vnow) + __field(u64, vrate) + __field(u64, last_period) + __field(u64, cur_period) + __field(u64, last_vtime) + __field(u64, vtime) + __field(u32, weight) + __field(u32, inuse) + __field(u64, hweight_active) + __field(u64, hweight_inuse) + ), + + TP_fast_assign( + __assign_str(devname, ioc_name(iocg->ioc)); + __assign_str(cgroup, path); + __entry->now = now->now; + __entry->vnow = now->vnow; + __entry->vrate = now->vrate; + __entry->last_period = last_period; + __entry->cur_period = cur_period; + __entry->last_vtime = iocg->last_vtime; + __entry->vtime = vtime; + __entry->weight = iocg->weight; + __entry->inuse = iocg->inuse; + __entry->hweight_active = iocg->hweight_active; + __entry->hweight_inuse = iocg->hweight_inuse; + ), + + TP_printk("[%s:%s] now=%llu:%llu vrate=%llu " + "period=%llu->%llu vtime=%llu->%llu " + "weight=%u/%u hweight=%llu/%llu", + __get_str(devname), __get_str(cgroup), + __entry->now, __entry->vnow, __entry->vrate, + __entry->last_period, __entry->cur_period, + __entry->last_vtime, __entry->vtime, + __entry->inuse, __entry->weight, + __entry->hweight_inuse, __entry->hweight_active + ) +); + +DECLARE_EVENT_CLASS(iocg_inuse_update, + + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, + u32 old_inuse, u32 new_inuse, + u64 old_hw_inuse, u64 new_hw_inuse), + + TP_ARGS(iocg, path, now, old_inuse, new_inuse, + old_hw_inuse, new_hw_inuse), + + TP_STRUCT__entry ( + __string(devname, ioc_name(iocg->ioc)) + __string(cgroup, path) + __field(u64, now) + __field(u32, old_inuse) + __field(u32, new_inuse) + __field(u64, old_hweight_inuse) + __field(u64, new_hweight_inuse) + ), + + TP_fast_assign( + __assign_str(devname, ioc_name(iocg->ioc)); + __assign_str(cgroup, path); + __entry->now = now->now; + __entry->old_inuse = old_inuse; + __entry->new_inuse = new_inuse; + __entry->old_hweight_inuse = old_hw_inuse; + __entry->new_hweight_inuse = new_hw_inuse; + ), + + TP_printk("[%s:%s] now=%llu inuse=%u->%u hw_inuse=%llu->%llu", + __get_str(devname), __get_str(cgroup), __entry->now, + __entry->old_inuse, __entry->new_inuse, + __entry->old_hweight_inuse, __entry->new_hweight_inuse + ) +); + +DEFINE_EVENT(iocg_inuse_update, iocost_inuse_takeback, + + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, + u32 old_inuse, u32 new_inuse, + u64 old_hw_inuse, u64 new_hw_inuse), + + TP_ARGS(iocg, path, now, old_inuse, new_inuse, + old_hw_inuse, new_hw_inuse) +); + +DEFINE_EVENT(iocg_inuse_update, iocost_inuse_giveaway, + + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, + u32 old_inuse, u32 new_inuse, + u64 old_hw_inuse, u64 new_hw_inuse), + + TP_ARGS(iocg, path, now, old_inuse, new_inuse, + old_hw_inuse, new_hw_inuse) +); + +DEFINE_EVENT(iocg_inuse_update, iocost_inuse_reset, + + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, + u32 old_inuse, u32 new_inuse, + u64 old_hw_inuse, u64 new_hw_inuse), + + TP_ARGS(iocg, path, now, old_inuse, new_inuse, + old_hw_inuse, new_hw_inuse) +); + +TRACE_EVENT(iocost_ioc_vrate_adj, + + TP_PROTO(struct ioc *ioc, u64 new_vrate, u32 (*missed_ppm)[2], + u32 rq_wait_pct, int nr_lagging, int nr_shortages, + int nr_surpluses), + + TP_ARGS(ioc, new_vrate, missed_ppm, rq_wait_pct, nr_lagging, nr_shortages, + nr_surpluses), + + TP_STRUCT__entry ( + __string(devname, ioc_name(ioc)) + __field(u64, old_vrate) + __field(u64, new_vrate) + __field(int, busy_level) + __field(u32, read_missed_ppm) + __field(u32, write_missed_ppm) + __field(u32, rq_wait_pct) + __field(int, nr_lagging) + __field(int, nr_shortages) + __field(int, nr_surpluses) + ), + + TP_fast_assign( + __assign_str(devname, ioc_name(ioc)); + __entry->old_vrate = atomic64_read(&ioc->vtime_rate);; + __entry->new_vrate = new_vrate; + __entry->busy_level = ioc->busy_level; + __entry->read_missed_ppm = (*missed_ppm)[READ]; + __entry->write_missed_ppm = (*missed_ppm)[WRITE]; + __entry->rq_wait_pct = rq_wait_pct; + __entry->nr_lagging = nr_lagging; + __entry->nr_shortages = nr_shortages; + __entry->nr_surpluses = nr_surpluses; + ), + + TP_printk("[%s] vrate=%llu->%llu busy=%d missed_ppm=%u:%u rq_wait_pct=%u lagging=%d shortages=%d surpluses=%d", + __get_str(devname), __entry->old_vrate, __entry->new_vrate, + __entry->busy_level, + __entry->read_missed_ppm, __entry->write_missed_ppm, + __entry->rq_wait_pct, __entry->nr_lagging, __entry->nr_shortages, + __entry->nr_surpluses + ) +); + +#endif /* _TRACE_BLK_IOCOST_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index aa7f3aeac740..3a27335fce2c 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -176,6 +176,132 @@ static inline unsigned int __trace_wbc_assign_cgroup(struct writeback_control *w #endif /* CONFIG_CGROUP_WRITEBACK */ #endif /* CREATE_TRACE_POINTS */ +#ifdef CONFIG_CGROUP_WRITEBACK +TRACE_EVENT(inode_foreign_history, + + TP_PROTO(struct inode *inode, struct writeback_control *wbc, + unsigned int history), + + TP_ARGS(inode, wbc, history), + + TP_STRUCT__entry( + __array(char, name, 32) + __field(unsigned long, ino) + __field(unsigned int, cgroup_ino) + __field(unsigned int, history) + ), + + TP_fast_assign( + strncpy(__entry->name, dev_name(inode_to_bdi(inode)->dev), 32); + __entry->ino = inode->i_ino; + __entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc); + __entry->history = history; + ), + + TP_printk("bdi %s: ino=%lu cgroup_ino=%u history=0x%x", + __entry->name, + __entry->ino, + __entry->cgroup_ino, + __entry->history + ) +); + +TRACE_EVENT(inode_switch_wbs, + + TP_PROTO(struct inode *inode, struct bdi_writeback *old_wb, + struct bdi_writeback *new_wb), + + TP_ARGS(inode, old_wb, new_wb), + + TP_STRUCT__entry( + __array(char, name, 32) + __field(unsigned long, ino) + __field(unsigned int, old_cgroup_ino) + __field(unsigned int, new_cgroup_ino) + ), + + TP_fast_assign( + strncpy(__entry->name, dev_name(old_wb->bdi->dev), 32); + __entry->ino = inode->i_ino; + __entry->old_cgroup_ino = __trace_wb_assign_cgroup(old_wb); + __entry->new_cgroup_ino = __trace_wb_assign_cgroup(new_wb); + ), + + TP_printk("bdi %s: ino=%lu old_cgroup_ino=%u new_cgroup_ino=%u", + __entry->name, + __entry->ino, + __entry->old_cgroup_ino, + __entry->new_cgroup_ino + ) +); + +TRACE_EVENT(track_foreign_dirty, + + TP_PROTO(struct page *page, struct bdi_writeback *wb), + + TP_ARGS(page, wb), + + TP_STRUCT__entry( + __array(char, name, 32) + __field(u64, bdi_id) + __field(unsigned long, ino) + __field(unsigned int, memcg_id) + __field(unsigned int, cgroup_ino) + __field(unsigned int, page_cgroup_ino) + ), + + TP_fast_assign( + struct address_space *mapping = page_mapping(page); + struct inode *inode = mapping ? mapping->host : NULL; + + strncpy(__entry->name, dev_name(wb->bdi->dev), 32); + __entry->bdi_id = wb->bdi->id; + __entry->ino = inode ? inode->i_ino : 0; + __entry->memcg_id = wb->memcg_css->id; + __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); + __entry->page_cgroup_ino = page->mem_cgroup->css.cgroup->kn->id.ino; + ), + + TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%u page_cgroup_ino=%u", + __entry->name, + __entry->bdi_id, + __entry->ino, + __entry->memcg_id, + __entry->cgroup_ino, + __entry->page_cgroup_ino + ) +); + +TRACE_EVENT(flush_foreign, + + TP_PROTO(struct bdi_writeback *wb, unsigned int frn_bdi_id, + unsigned int frn_memcg_id), + + TP_ARGS(wb, frn_bdi_id, frn_memcg_id), + + TP_STRUCT__entry( + __array(char, name, 32) + __field(unsigned int, cgroup_ino) + __field(unsigned int, frn_bdi_id) + __field(unsigned int, frn_memcg_id) + ), + + TP_fast_assign( + strncpy(__entry->name, dev_name(wb->bdi->dev), 32); + __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); + __entry->frn_bdi_id = frn_bdi_id; + __entry->frn_memcg_id = frn_memcg_id; + ), + + TP_printk("bdi %s: cgroup_ino=%u frn_bdi_id=%u frn_memcg_id=%u", + __entry->name, + __entry->cgroup_ino, + __entry->frn_bdi_id, + __entry->frn_memcg_id + ) +); +#endif + DECLARE_EVENT_CLASS(writeback_write_inode_template, TP_PROTO(struct inode *inode, struct writeback_control *wbc), diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h index b0d15c73f6d7..1f2d8c81f0e0 100644 --- a/include/uapi/linux/raid/md_p.h +++ b/include/uapi/linux/raid/md_p.h @@ -329,6 +329,7 @@ struct mdp_superblock_1 { #define MD_FEATURE_JOURNAL 512 /* support write cache */ #define MD_FEATURE_PPL 1024 /* support PPL */ #define MD_FEATURE_MULTIPLE_PPLS 2048 /* support for multiple PPLs */ +#define MD_FEATURE_RAID0_LAYOUT 4096 /* layout is meaningful for RAID0 */ #define MD_FEATURE_ALL (MD_FEATURE_BITMAP_OFFSET \ |MD_FEATURE_RECOVERY_OFFSET \ |MD_FEATURE_RESHAPE_ACTIVE \ @@ -341,6 +342,7 @@ struct mdp_superblock_1 { |MD_FEATURE_JOURNAL \ |MD_FEATURE_PPL \ |MD_FEATURE_MULTIPLE_PPLS \ + |MD_FEATURE_RAID0_LAYOUT \ ) struct r5l_payload_header { diff --git a/lib/sg_split.c b/lib/sg_split.c index 9982c63d1063..60a0babebf2e 100644 --- a/lib/sg_split.c +++ b/lib/sg_split.c @@ -176,11 +176,13 @@ int sg_split(struct scatterlist *in, const int in_mapped_nents, * The order of these 3 calls is important and should be kept. */ sg_split_phys(splitters, nb_splits); - ret = sg_calculate_split(in, in_mapped_nents, nb_splits, skip, - split_sizes, splitters, true); - if (ret < 0) - goto err; - sg_split_mapped(splitters, nb_splits); + if (in_mapped_nents) { + ret = sg_calculate_split(in, in_mapped_nents, nb_splits, skip, + split_sizes, splitters, true); + if (ret < 0) + goto err; + sg_split_mapped(splitters, nb_splits); + } for (i = 0; i < nb_splits; i++) { out[i] = splitters[i].out_sg; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index e8e89158adec..d9daa3e422d0 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only #include <linux/wait.h> +#include <linux/rbtree.h> #include <linux/backing-dev.h> #include <linux/kthread.h> #include <linux/freezer.h> @@ -22,10 +23,12 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info); static struct class *bdi_class; /* - * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side - * locking. + * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU + * reader side locking. */ DEFINE_SPINLOCK(bdi_lock); +static u64 bdi_id_cursor; +static struct rb_root bdi_tree = RB_ROOT; LIST_HEAD(bdi_list); /* bdi_wq serves all asynchronous writeback tasks */ @@ -615,13 +618,12 @@ out_put: } /** - * wb_get_create - get wb for a given memcg, create if necessary + * wb_get_lookup - get wb for a given memcg * @bdi: target bdi * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) - * @gfp: allocation mask to use * - * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to - * create one. The returned wb has its refcount incremented. + * Try to get the wb for @memcg_css on @bdi. The returned wb has its + * refcount incremented. * * This function uses css_get() on @memcg_css and thus expects its refcnt * to be positive on invocation. IOW, rcu_read_lock() protection on @@ -638,6 +640,39 @@ out_put: * each lookup. On mismatch, the existing wb is discarded and a new one is * created. */ +struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css) +{ + struct bdi_writeback *wb; + + if (!memcg_css->parent) + return &bdi->wb; + + rcu_read_lock(); + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + if (wb) { + struct cgroup_subsys_state *blkcg_css; + + /* see whether the blkcg association has changed */ + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &io_cgrp_subsys); + if (unlikely(wb->blkcg_css != blkcg_css || !wb_tryget(wb))) + wb = NULL; + css_put(blkcg_css); + } + rcu_read_unlock(); + + return wb; +} + +/** + * wb_get_create - get wb for a given memcg, create if necessary + * @bdi: target bdi + * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) + * @gfp: allocation mask to use + * + * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to + * create one. See wb_get_lookup() for more details. + */ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp) @@ -650,20 +685,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, return &bdi->wb; do { - rcu_read_lock(); - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); - if (wb) { - struct cgroup_subsys_state *blkcg_css; - - /* see whether the blkcg association has changed */ - blkcg_css = cgroup_get_e_css(memcg_css->cgroup, - &io_cgrp_subsys); - if (unlikely(wb->blkcg_css != blkcg_css || - !wb_tryget(wb))) - wb = NULL; - css_put(blkcg_css); - } - rcu_read_unlock(); + wb = wb_get_lookup(bdi, memcg_css); } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); return wb; @@ -859,9 +881,58 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id) } EXPORT_SYMBOL(bdi_alloc_node); +static struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp) +{ + struct rb_node **p = &bdi_tree.rb_node; + struct rb_node *parent = NULL; + struct backing_dev_info *bdi; + + lockdep_assert_held(&bdi_lock); + + while (*p) { + parent = *p; + bdi = rb_entry(parent, struct backing_dev_info, rb_node); + + if (bdi->id > id) + p = &(*p)->rb_left; + else if (bdi->id < id) + p = &(*p)->rb_right; + else + break; + } + + if (parentp) + *parentp = parent; + return p; +} + +/** + * bdi_get_by_id - lookup and get bdi from its id + * @id: bdi id to lookup + * + * Find bdi matching @id and get it. Returns NULL if the matching bdi + * doesn't exist or is already unregistered. + */ +struct backing_dev_info *bdi_get_by_id(u64 id) +{ + struct backing_dev_info *bdi = NULL; + struct rb_node **p; + + spin_lock_bh(&bdi_lock); + p = bdi_lookup_rb_node(id, NULL); + if (*p) { + bdi = rb_entry(*p, struct backing_dev_info, rb_node); + bdi_get(bdi); + } + spin_unlock_bh(&bdi_lock); + + return bdi; +} + int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) { struct device *dev; + struct rb_node *parent, **p; if (bdi->dev) /* The driver needs to use separate queues per device */ return 0; @@ -877,7 +948,15 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) set_bit(WB_registered, &bdi->wb.state); spin_lock_bh(&bdi_lock); + + bdi->id = ++bdi_id_cursor; + + p = bdi_lookup_rb_node(bdi->id, &parent); + rb_link_node(&bdi->rb_node, parent, p); + rb_insert_color(&bdi->rb_node, &bdi_tree); + list_add_tail_rcu(&bdi->bdi_list, &bdi_list); + spin_unlock_bh(&bdi_lock); trace_writeback_bdi_register(bdi); @@ -918,6 +997,7 @@ EXPORT_SYMBOL(bdi_register_owner); static void bdi_remove_from_list(struct backing_dev_info *bdi) { spin_lock_bh(&bdi_lock); + rb_erase(&bdi->rb_node, &bdi_tree); list_del_rcu(&bdi->bdi_list); spin_unlock_bh(&bdi_lock); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9ec5e12486a7..597d58101872 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -87,6 +87,10 @@ int do_swap_account __read_mostly; #define do_swap_account 0 #endif +#ifdef CONFIG_CGROUP_WRITEBACK +static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); +#endif + /* Whether legacy memory+swap accounting is active */ static bool do_memsw_account(void) { @@ -4172,6 +4176,8 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, #ifdef CONFIG_CGROUP_WRITEBACK +#include <trace/events/writeback.h> + static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) { return wb_domain_init(&memcg->cgwb_domain, gfp); @@ -4255,6 +4261,130 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, } } +/* + * Foreign dirty flushing + * + * There's an inherent mismatch between memcg and writeback. The former + * trackes ownership per-page while the latter per-inode. This was a + * deliberate design decision because honoring per-page ownership in the + * writeback path is complicated, may lead to higher CPU and IO overheads + * and deemed unnecessary given that write-sharing an inode across + * different cgroups isn't a common use-case. + * + * Combined with inode majority-writer ownership switching, this works well + * enough in most cases but there are some pathological cases. For + * example, let's say there are two cgroups A and B which keep writing to + * different but confined parts of the same inode. B owns the inode and + * A's memory is limited far below B's. A's dirty ratio can rise enough to + * trigger balance_dirty_pages() sleeps but B's can be low enough to avoid + * triggering background writeback. A will be slowed down without a way to + * make writeback of the dirty pages happen. + * + * Conditions like the above can lead to a cgroup getting repatedly and + * severely throttled after making some progress after each + * dirty_expire_interval while the underyling IO device is almost + * completely idle. + * + * Solving this problem completely requires matching the ownership tracking + * granularities between memcg and writeback in either direction. However, + * the more egregious behaviors can be avoided by simply remembering the + * most recent foreign dirtying events and initiating remote flushes on + * them when local writeback isn't enough to keep the memory clean enough. + * + * The following two functions implement such mechanism. When a foreign + * page - a page whose memcg and writeback ownerships don't match - is + * dirtied, mem_cgroup_track_foreign_dirty() records the inode owning + * bdi_writeback on the page owning memcg. When balance_dirty_pages() + * decides that the memcg needs to sleep due to high dirty ratio, it calls + * mem_cgroup_flush_foreign() which queues writeback on the recorded + * foreign bdi_writebacks which haven't expired. Both the numbers of + * recorded bdi_writebacks and concurrent in-flight foreign writebacks are + * limited to MEMCG_CGWB_FRN_CNT. + * + * The mechanism only remembers IDs and doesn't hold any object references. + * As being wrong occasionally doesn't matter, updates and accesses to the + * records are lockless and racy. + */ +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, + struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = page->mem_cgroup; + struct memcg_cgwb_frn *frn; + u64 now = get_jiffies_64(); + u64 oldest_at = now; + int oldest = -1; + int i; + + trace_track_foreign_dirty(page, wb); + + /* + * Pick the slot to use. If there is already a slot for @wb, keep + * using it. If not replace the oldest one which isn't being + * written out. + */ + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + frn = &memcg->cgwb_frn[i]; + if (frn->bdi_id == wb->bdi->id && + frn->memcg_id == wb->memcg_css->id) + break; + if (time_before64(frn->at, oldest_at) && + atomic_read(&frn->done.cnt) == 1) { + oldest = i; + oldest_at = frn->at; + } + } + + if (i < MEMCG_CGWB_FRN_CNT) { + /* + * Re-using an existing one. Update timestamp lazily to + * avoid making the cacheline hot. We want them to be + * reasonably up-to-date and significantly shorter than + * dirty_expire_interval as that's what expires the record. + * Use the shorter of 1s and dirty_expire_interval / 8. + */ + unsigned long update_intv = + min_t(unsigned long, HZ, + msecs_to_jiffies(dirty_expire_interval * 10) / 8); + + if (time_before64(frn->at, now - update_intv)) + frn->at = now; + } else if (oldest >= 0) { + /* replace the oldest free one */ + frn = &memcg->cgwb_frn[oldest]; + frn->bdi_id = wb->bdi->id; + frn->memcg_id = wb->memcg_css->id; + frn->at = now; + } +} + +/* issue foreign writeback flushes for recorded foreign dirtying events */ +void mem_cgroup_flush_foreign(struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); + unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10); + u64 now = jiffies_64; + int i; + + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i]; + + /* + * If the record is older than dirty_expire_interval, + * writeback on it has already started. No need to kick it + * off again. Also, don't start a new one if there's + * already one in flight. + */ + if (time_after64(frn->at, now - intv) && + atomic_read(&frn->done.cnt) == 1) { + frn->at = 0; + trace_flush_foreign(wb, frn->bdi_id, frn->memcg_id); + cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, 0, + WB_REASON_FOREIGN_FLUSH, + &frn->done); + } + } +} + #else /* CONFIG_CGROUP_WRITEBACK */ static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) @@ -4777,6 +4907,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) struct mem_cgroup *memcg; unsigned int size; int node; + int __maybe_unused i; size = sizeof(struct mem_cgroup); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *); @@ -4820,6 +4951,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void) #endif #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) + memcg->cgwb_frn[i].done = + __WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq); #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; @@ -4949,7 +5083,12 @@ static void mem_cgroup_css_released(struct cgroup_subsys_state *css) static void mem_cgroup_css_free(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + int __maybe_unused i; +#ifdef CONFIG_CGROUP_WRITEBACK + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) + wb_wait_for_completion(&memcg->cgwb_frn[i].done); +#endif if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_dec(&memcg_sockets_enabled_key); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 1804f64ff43c..50055d2e4ea8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1667,6 +1667,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb, if (unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); + mem_cgroup_flush_foreign(wb); + /* * Calculate global domain's pos_ratio and select the * global dtc by default. @@ -2427,6 +2429,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) task_io_account_write(PAGE_SIZE); current->nr_dirtied++; this_cpu_inc(bdp_ratelimits); + + mem_cgroup_track_foreign_dirty(page, wb); } } diff --git a/tools/cgroup/iocost_coef_gen.py b/tools/cgroup/iocost_coef_gen.py new file mode 100644 index 000000000000..df17a2ae80e5 --- /dev/null +++ b/tools/cgroup/iocost_coef_gen.py @@ -0,0 +1,178 @@ +#!/usr/bin/env python3 +# +# Copyright (C) 2019 Tejun Heo <tj@kernel.org> +# Copyright (C) 2019 Andy Newell <newella@fb.com> +# Copyright (C) 2019 Facebook + +desc = """ +Generate linear IO cost model coefficients used by the blk-iocost +controller. If the target raw testdev is specified, destructive tests +are performed against the whole device; otherwise, on +./iocost-coef-fio.testfile. The result can be written directly to +/sys/fs/cgroup/io.cost.model. + +On high performance devices, --numjobs > 1 is needed to achieve +saturation. + +See Documentation/admin-guide/cgroup-v2.rst and block/blk-iocost.c +for more details. +""" + +import argparse +import re +import json +import glob +import os +import sys +import atexit +import shutil +import tempfile +import subprocess + +parser = argparse.ArgumentParser(description=desc, + formatter_class=argparse.RawTextHelpFormatter) +parser.add_argument('--testdev', metavar='DEV', + help='Raw block device to use for testing, ignores --testfile-size') +parser.add_argument('--testfile-size-gb', type=float, metavar='GIGABYTES', default=16, + help='Testfile size in gigabytes (default: %(default)s)') +parser.add_argument('--duration', type=int, metavar='SECONDS', default=120, + help='Individual test run duration in seconds (default: %(default)s)') +parser.add_argument('--seqio-block-mb', metavar='MEGABYTES', type=int, default=128, + help='Sequential test block size in megabytes (default: %(default)s)') +parser.add_argument('--seq-depth', type=int, metavar='DEPTH', default=64, + help='Sequential test queue depth (default: %(default)s)') +parser.add_argument('--rand-depth', type=int, metavar='DEPTH', default=64, + help='Random test queue depth (default: %(default)s)') +parser.add_argument('--numjobs', type=int, metavar='JOBS', default=1, + help='Number of parallel fio jobs to run (default: %(default)s)') +parser.add_argument('--quiet', action='store_true') +parser.add_argument('--verbose', action='store_true') + +def info(msg): + if not args.quiet: + print(msg) + +def dbg(msg): + if args.verbose and not args.quiet: + print(msg) + +# determine ('DEVNAME', 'MAJ:MIN') for @path +def dir_to_dev(path): + # find the block device the current directory is on + devname = subprocess.run(f'findmnt -nvo SOURCE -T{path}', + stdout=subprocess.PIPE, shell=True).stdout + devname = os.path.basename(devname).decode('utf-8').strip() + + # partition -> whole device + parents = glob.glob('/sys/block/*/' + devname) + if len(parents): + devname = os.path.basename(os.path.dirname(parents[0])) + rdev = os.stat(f'/dev/{devname}').st_rdev + return (devname, f'{os.major(rdev)}:{os.minor(rdev)}') + +def create_testfile(path, size): + global args + + if os.path.isfile(path) and os.stat(path).st_size == size: + return + + info(f'Creating testfile {path}') + subprocess.check_call(f'rm -f {path}', shell=True) + subprocess.check_call(f'touch {path}', shell=True) + subprocess.call(f'chattr +C {path}', shell=True) + subprocess.check_call( + f'pv -s {size} -pr /dev/urandom {"-q" if args.quiet else ""} | ' + f'dd of={path} count={size} ' + f'iflag=count_bytes,fullblock oflag=direct bs=16M status=none', + shell=True) + +def run_fio(testfile, duration, iotype, iodepth, blocksize, jobs): + global args + + eta = 'never' if args.quiet else 'always' + outfile = tempfile.NamedTemporaryFile() + cmd = (f'fio --direct=1 --ioengine=libaio --name=coef ' + f'--filename={testfile} --runtime={round(duration)} ' + f'--readwrite={iotype} --iodepth={iodepth} --blocksize={blocksize} ' + f'--eta={eta} --output-format json --output={outfile.name} ' + f'--time_based --numjobs={jobs}') + if args.verbose: + dbg(f'Running {cmd}') + subprocess.check_call(cmd, shell=True) + with open(outfile.name, 'r') as f: + d = json.loads(f.read()) + return sum(j['read']['bw_bytes'] + j['write']['bw_bytes'] for j in d['jobs']) + +def restore_elevator_nomerges(): + global elevator_path, nomerges_path, elevator, nomerges + + info(f'Restoring elevator to {elevator} and nomerges to {nomerges}') + with open(elevator_path, 'w') as f: + f.write(elevator) + with open(nomerges_path, 'w') as f: + f.write(nomerges) + + +args = parser.parse_args() + +missing = False +for cmd in [ 'findmnt', 'pv', 'dd', 'fio' ]: + if not shutil.which(cmd): + print(f'Required command "{cmd}" is missing', file=sys.stderr) + missing = True +if missing: + sys.exit(1) + +if args.testdev: + devname = os.path.basename(args.testdev) + rdev = os.stat(f'/dev/{devname}').st_rdev + devno = f'{os.major(rdev)}:{os.minor(rdev)}' + testfile = f'/dev/{devname}' + info(f'Test target: {devname}({devno})') +else: + devname, devno = dir_to_dev('.') + testfile = 'iocost-coef-fio.testfile' + testfile_size = int(args.testfile_size_gb * 2 ** 30) + create_testfile(testfile, testfile_size) + info(f'Test target: {testfile} on {devname}({devno})') + +elevator_path = f'/sys/block/{devname}/queue/scheduler' +nomerges_path = f'/sys/block/{devname}/queue/nomerges' + +with open(elevator_path, 'r') as f: + elevator = re.sub(r'.*\[(.*)\].*', r'\1', f.read().strip()) +with open(nomerges_path, 'r') as f: + nomerges = f.read().strip() + +info(f'Temporarily disabling elevator and merges') +atexit.register(restore_elevator_nomerges) +with open(elevator_path, 'w') as f: + f.write('none') +with open(nomerges_path, 'w') as f: + f.write('1') + +info('Determining rbps...') +rbps = run_fio(testfile, args.duration, 'read', + 1, args.seqio_block_mb * (2 ** 20), args.numjobs) +info(f'\nrbps={rbps}, determining rseqiops...') +rseqiops = round(run_fio(testfile, args.duration, 'read', + args.seq_depth, 4096, args.numjobs) / 4096) +info(f'\nrseqiops={rseqiops}, determining rrandiops...') +rrandiops = round(run_fio(testfile, args.duration, 'randread', + args.rand_depth, 4096, args.numjobs) / 4096) +info(f'\nrrandiops={rrandiops}, determining wbps...') +wbps = run_fio(testfile, args.duration, 'write', + 1, args.seqio_block_mb * (2 ** 20), args.numjobs) +info(f'\nwbps={wbps}, determining wseqiops...') +wseqiops = round(run_fio(testfile, args.duration, 'write', + args.seq_depth, 4096, args.numjobs) / 4096) +info(f'\nwseqiops={wseqiops}, determining wrandiops...') +wrandiops = round(run_fio(testfile, args.duration, 'randwrite', + args.rand_depth, 4096, args.numjobs) / 4096) +info(f'\nwrandiops={wrandiops}') +restore_elevator_nomerges() +atexit.unregister(restore_elevator_nomerges) +info('') + +print(f'{devno} rbps={rbps} rseqiops={rseqiops} rrandiops={rrandiops} ' + f'wbps={wbps} wseqiops={wseqiops} wrandiops={wrandiops}') diff --git a/tools/cgroup/iocost_monitor.py b/tools/cgroup/iocost_monitor.py new file mode 100644 index 000000000000..f79b23582a1d --- /dev/null +++ b/tools/cgroup/iocost_monitor.py @@ -0,0 +1,277 @@ +#!/usr/bin/env drgn +# +# Copyright (C) 2019 Tejun Heo <tj@kernel.org> +# Copyright (C) 2019 Facebook + +desc = """ +This is a drgn script to monitor the blk-iocost cgroup controller. +See the comment at the top of block/blk-iocost.c for more details. +For drgn, visit https://github.com/osandov/drgn. +""" + +import sys +import re +import time +import json +import math + +import drgn +from drgn import container_of +from drgn.helpers.linux.list import list_for_each_entry,list_empty +from drgn.helpers.linux.radixtree import radix_tree_for_each,radix_tree_lookup + +import argparse +parser = argparse.ArgumentParser(description=desc, + formatter_class=argparse.RawTextHelpFormatter) +parser.add_argument('devname', metavar='DEV', + help='Target block device name (e.g. sda)') +parser.add_argument('--cgroup', action='append', metavar='REGEX', + help='Regex for target cgroups, ') +parser.add_argument('--interval', '-i', metavar='SECONDS', type=float, default=1, + help='Monitoring interval in seconds') +parser.add_argument('--json', action='store_true', + help='Output in json') +args = parser.parse_args() + +def err(s): + print(s, file=sys.stderr, flush=True) + sys.exit(1) + +try: + blkcg_root = prog['blkcg_root'] + plid = prog['blkcg_policy_iocost'].plid.value_() +except: + err('The kernel does not have iocost enabled') + +IOC_RUNNING = prog['IOC_RUNNING'].value_() +NR_USAGE_SLOTS = prog['NR_USAGE_SLOTS'].value_() +HWEIGHT_WHOLE = prog['HWEIGHT_WHOLE'].value_() +VTIME_PER_SEC = prog['VTIME_PER_SEC'].value_() +VTIME_PER_USEC = prog['VTIME_PER_USEC'].value_() +AUTOP_SSD_FAST = prog['AUTOP_SSD_FAST'].value_() +AUTOP_SSD_DFL = prog['AUTOP_SSD_DFL'].value_() +AUTOP_SSD_QD1 = prog['AUTOP_SSD_QD1'].value_() +AUTOP_HDD = prog['AUTOP_HDD'].value_() + +autop_names = { + AUTOP_SSD_FAST: 'ssd_fast', + AUTOP_SSD_DFL: 'ssd_dfl', + AUTOP_SSD_QD1: 'ssd_qd1', + AUTOP_HDD: 'hdd', +} + +class BlkgIterator: + def blkcg_name(blkcg): + return blkcg.css.cgroup.kn.name.string_().decode('utf-8') + + def walk(self, blkcg, q_id, parent_path): + if not self.include_dying and \ + not (blkcg.css.flags.value_() & prog['CSS_ONLINE'].value_()): + return + + name = BlkgIterator.blkcg_name(blkcg) + path = parent_path + '/' + name if parent_path else name + blkg = drgn.Object(prog, 'struct blkcg_gq', + address=radix_tree_lookup(blkcg.blkg_tree, q_id)) + if not blkg.address_: + return + + self.blkgs.append((path if path else '/', blkg)) + + for c in list_for_each_entry('struct blkcg', + blkcg.css.children.address_of_(), 'css.sibling'): + self.walk(c, q_id, path) + + def __init__(self, root_blkcg, q_id, include_dying=False): + self.include_dying = include_dying + self.blkgs = [] + self.walk(root_blkcg, q_id, '') + + def __iter__(self): + return iter(self.blkgs) + +class IocStat: + def __init__(self, ioc): + global autop_names + + self.enabled = ioc.enabled.value_() + self.running = ioc.running.value_() == IOC_RUNNING + self.period_ms = ioc.period_us.value_() / 1_000 + self.period_at = ioc.period_at.value_() / 1_000_000 + self.vperiod_at = ioc.period_at_vtime.value_() / VTIME_PER_SEC + self.vrate_pct = ioc.vtime_rate.counter.value_() * 100 / VTIME_PER_USEC + self.busy_level = ioc.busy_level.value_() + self.autop_idx = ioc.autop_idx.value_() + self.user_cost_model = ioc.user_cost_model.value_() + self.user_qos_params = ioc.user_qos_params.value_() + + if self.autop_idx in autop_names: + self.autop_name = autop_names[self.autop_idx] + else: + self.autop_name = '?' + + def dict(self, now): + return { 'device' : devname, + 'timestamp' : str(now), + 'enabled' : str(int(self.enabled)), + 'running' : str(int(self.running)), + 'period_ms' : str(self.period_ms), + 'period_at' : str(self.period_at), + 'period_vtime_at' : str(self.vperiod_at), + 'busy_level' : str(self.busy_level), + 'vrate_pct' : str(self.vrate_pct), } + + def table_preamble_str(self): + state = ('RUN' if self.running else 'IDLE') if self.enabled else 'OFF' + output = f'{devname} {state:4} ' \ + f'per={self.period_ms}ms ' \ + f'cur_per={self.period_at:.3f}:v{self.vperiod_at:.3f} ' \ + f'busy={self.busy_level:+3} ' \ + f'vrate={self.vrate_pct:6.2f}% ' \ + f'params={self.autop_name}' + if self.user_cost_model or self.user_qos_params: + output += f'({"C" if self.user_cost_model else ""}{"Q" if self.user_qos_params else ""})' + return output + + def table_header_str(self): + return f'{"":25} active {"weight":>9} {"hweight%":>13} {"inflt%":>6} ' \ + f'{"dbt":>3} {"delay":>6} {"usages%"}' + +class IocgStat: + def __init__(self, iocg): + ioc = iocg.ioc + blkg = iocg.pd.blkg + + self.is_active = not list_empty(iocg.active_list.address_of_()) + self.weight = iocg.weight.value_() + self.active = iocg.active.value_() + self.inuse = iocg.inuse.value_() + self.hwa_pct = iocg.hweight_active.value_() * 100 / HWEIGHT_WHOLE + self.hwi_pct = iocg.hweight_inuse.value_() * 100 / HWEIGHT_WHOLE + self.address = iocg.value_() + + vdone = iocg.done_vtime.counter.value_() + vtime = iocg.vtime.counter.value_() + vrate = ioc.vtime_rate.counter.value_() + period_vtime = ioc.period_us.value_() * vrate + if period_vtime: + self.inflight_pct = (vtime - vdone) * 100 / period_vtime + else: + self.inflight_pct = 0 + + self.debt_ms = iocg.abs_vdebt.counter.value_() / VTIME_PER_USEC / 1000 + self.use_delay = blkg.use_delay.counter.value_() + self.delay_ms = blkg.delay_nsec.counter.value_() / 1_000_000 + + usage_idx = iocg.usage_idx.value_() + self.usages = [] + self.usage = 0 + for i in range(NR_USAGE_SLOTS): + usage = iocg.usages[(usage_idx + i) % NR_USAGE_SLOTS].value_() + upct = usage * 100 / HWEIGHT_WHOLE + self.usages.append(upct) + self.usage = max(self.usage, upct) + + def dict(self, now, path): + out = { 'cgroup' : path, + 'timestamp' : str(now), + 'is_active' : str(int(self.is_active)), + 'weight' : str(self.weight), + 'weight_active' : str(self.active), + 'weight_inuse' : str(self.inuse), + 'hweight_active_pct' : str(self.hwa_pct), + 'hweight_inuse_pct' : str(self.hwi_pct), + 'inflight_pct' : str(self.inflight_pct), + 'debt_ms' : str(self.debt_ms), + 'use_delay' : str(self.use_delay), + 'delay_ms' : str(self.delay_ms), + 'usage_pct' : str(self.usage), + 'address' : str(hex(self.address)) } + for i in range(len(self.usages)): + out[f'usage_pct_{i}'] = str(self.usages[i]) + return out + + def table_row_str(self, path): + out = f'{path[-28:]:28} ' \ + f'{"*" if self.is_active else " "} ' \ + f'{self.inuse:5}/{self.active:5} ' \ + f'{self.hwi_pct:6.2f}/{self.hwa_pct:6.2f} ' \ + f'{self.inflight_pct:6.2f} ' \ + f'{min(math.ceil(self.debt_ms), 999):3} ' \ + f'{min(self.use_delay, 99):2}*'\ + f'{min(math.ceil(self.delay_ms), 999):03} ' + for u in self.usages: + out += f'{min(round(u), 999):03d}:' + out = out.rstrip(':') + return out + +# handle args +table_fmt = not args.json +interval = args.interval +devname = args.devname + +if args.json: + table_fmt = False + +re_str = None +if args.cgroup: + for r in args.cgroup: + if re_str is None: + re_str = r + else: + re_str += '|' + r + +filter_re = re.compile(re_str) if re_str else None + +# Locate the roots +q_id = None +root_iocg = None +ioc = None + +for i, ptr in radix_tree_for_each(blkcg_root.blkg_tree): + blkg = drgn.Object(prog, 'struct blkcg_gq', address=ptr) + try: + if devname == blkg.q.kobj.parent.name.string_().decode('utf-8'): + q_id = blkg.q.id.value_() + if blkg.pd[plid]: + root_iocg = container_of(blkg.pd[plid], 'struct ioc_gq', 'pd') + ioc = root_iocg.ioc + break + except: + pass + +if ioc is None: + err(f'Could not find ioc for {devname}'); + +# Keep printing +while True: + now = time.time() + iocstat = IocStat(ioc) + output = '' + + if table_fmt: + output += '\n' + iocstat.table_preamble_str() + output += '\n' + iocstat.table_header_str() + else: + output += json.dumps(iocstat.dict(now)) + + for path, blkg in BlkgIterator(blkcg_root, q_id): + if filter_re and not filter_re.match(path): + continue + if not blkg.pd[plid]: + continue + + iocg = container_of(blkg.pd[plid], 'struct ioc_gq', 'pd') + iocg_stat = IocgStat(iocg) + + if not filter_re and not iocg_stat.is_active: + continue + + if table_fmt: + output += '\n' + iocg_stat.table_row_str(path) + else: + output += '\n' + json.dumps(iocg_stat.dict(now, path)) + + print(output) + sys.stdout.flush() + time.sleep(interval) |