summaryrefslogtreecommitdiffstats
path: root/block (follow)
Commit message (Collapse)AuthorAgeFilesLines
* blk-mq: don't lose requests if a stopped queue restartsShaohua Li2015-05-041-0/+10
| | | | | | | | | | | | | | | | | | | Normally if driver is busy to dispatch a request the logic is like below: block layer: driver: __blk_mq_run_hw_queue a. blk_mq_stop_hw_queue b. rq add to ctx->dispatch later: 1. blk_mq_start_hw_queue 2. __blk_mq_run_hw_queue But it's possible step 1-2 runs between a and b. And since rq isn't in ctx->dispatch yet, step 2 will not run rq. The rq might get lost if there are no subsequent requests kick in. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: destroy bdi before blockdev is unregistered.NeilBrown2015-04-272-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because of the peculiar way that md devices are created (automatically when the device node is opened), a new device can be created and registered immediately after the blk_unregister_region(disk_devt(disk), disk->minors); call in del_gendisk(). Therefore it is important that all visible artifacts of the previous device are removed before this call. In particular, the 'bdi'. Since: commit c4db59d31e39ea067c32163ac961e9c80198fd37 Author: Christoph Hellwig <hch@lst.de> fs: don't reassign dirty inodes to default_backing_dev_info moved the device_unregister(bdi->dev); call from bdi_unregister() to bdi_destroy() it has been quite easy to lose a race and have a new (e.g.) "md127" be created after the blk_unregister_region() call and before bdi_destroy() is ultimately called by the final 'put_disk', which must come after del_gendisk(). The new device finds that the bdi name is already registered in sysfs and complains > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70() > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127' We can fix this by moving the bdi_destroy() call out of blk_release_queue() (which can happen very late when a refcount reaches zero) and into blk_cleanup_queue() - which happens exactly when the md device driver calls it. Then it is only necessary for md to call blk_cleanup_queue() before del_gendisk(). As loop.c devices are also created on demand by opening the device node, we make the same change there. Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37 Reported-by: Azat Khuzhin <a3at.mail@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org (v4.0) Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block:bounce: fix call inc_|dec_zone_page_state on different pages confuse ↵Wang YanQing2015-04-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | value of NR_BOUNCE Commit d2c5e30c9a1420902262aa923794d2ae4e0bc391 ("[PATCH] zoned vm counters: conversion of nr_bounce to per zone counter") convert statistic of nr_bounce to per zone and one global value in vm_stat, but it call inc_|dec_zone_page_state on different pages, then different zones, and cause us to get unexpected value of NR_BOUNCE. Below is the result on my machine: Mar 2 09:26:08 udknight kernel: [144766.778265] Mem-Info: Mar 2 09:26:08 udknight kernel: [144766.778266] DMA per-cpu: Mar 2 09:26:08 udknight kernel: [144766.778268] CPU 0: hi: 0, btch: 1 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778269] CPU 1: hi: 0, btch: 1 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778270] Normal per-cpu: Mar 2 09:26:08 udknight kernel: [144766.778271] CPU 0: hi: 186, btch: 31 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778273] CPU 1: hi: 186, btch: 31 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778274] HighMem per-cpu: Mar 2 09:26:08 udknight kernel: [144766.778275] CPU 0: hi: 186, btch: 31 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778276] CPU 1: hi: 186, btch: 31 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778279] active_anon:46926 inactive_anon:287406 isolated_anon:0 Mar 2 09:26:08 udknight kernel: [144766.778279] active_file:105085 inactive_file:139432 isolated_file:0 Mar 2 09:26:08 udknight kernel: [144766.778279] unevictable:653 dirty:0 writeback:0 unstable:0 Mar 2 09:26:08 udknight kernel: [144766.778279] free:178957 slab_reclaimable:6419 slab_unreclaimable:9966 Mar 2 09:26:08 udknight kernel: [144766.778279] mapped:4426 shmem:305277 pagetables:784 bounce:0 Mar 2 09:26:08 udknight kernel: [144766.778279] free_cma:0 Mar 2 09:26:08 udknight kernel: [144766.778286] DMA free:3324kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Mar 2 09:26:08 udknight kernel: [144766.778287] lowmem_reserve[]: 0 822 3754 3754 Mar 2 09:26:08 udknight kernel: [144766.778293] Normal free:26828kB min:3632kB low:4540kB high:5448kB active_anon:4872kB inactive_anon:68kB active_file:1796kB inactive_file:1796kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:892920kB managed:842560kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:4144kB slab_reclaimable:25676kB slab_unreclaimable:39864kB kernel_stack:1944kB pagetables:3136kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2412612 all_unreclaimable? yes Mar 2 09:26:08 udknight kernel: [144766.778294] lowmem_reserve[]: 0 0 23451 23451 Mar 2 09:26:08 udknight kernel: [144766.778299] HighMem free:685676kB min:512kB low:3748kB high:6984kB active_anon:182832kB inactive_anon:1149556kB active_file:418544kB inactive_file:555932kB unevictable:2612kB isolated(anon):0kB isolated(file):0kB present:3001732kB managed:3001732kB mlocked:0kB dirty:0kB writeback:0kB mapped:17704kB shmem:1216964kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:75771152kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Mar 2 09:26:08 udknight kernel: [144766.778300] lowmem_reserve[]: 0 0 0 0 You can see bounce:75771152kB for HighMem, but bounce:0 for lowmem and global. This patch fix it. Signed-off-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* elevator: fix double release of elevator moduleChao Yu2015-04-231-5/+1
| | | | | | | | | | | | | | | | | | Our issue is descripted in below call path: ->elevator_init ->elevator_init_fn ->{cfq,deadline,noop}_init_queue ->elevator_alloc ->kzalloc_node fail to call kzalloc_node and then put module in elevator_alloc; fail to call elevator_init_fn and then put module again in elevator_init. Remove elevator_put invoking in error path of elevator_alloc to avoid double release issue. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: fix CPU hotplug handlingMing Lei2015-04-231-21/+13
| | | | | | | | | | | | | | | | | | hctx->tags has to be set as NULL in case that it is to be unmapped no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue() because shared tags can be freed already from another request queue. The same situation has to be considered during handling CPU online too. Unmapped hw queue can be remapped after CPU topo is changed, so we need to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags allocation for hw queue can be removed in hctx cpu online notifier, and it is reasonable to do that after mapping is updated. Cc: <stable@vger.kernel.org> Reported-by: Dongsu Park <dongsu.park@profitbricks.com> Tested-by: Dongsu Park <dongsu.park@profitbricks.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: fix race between timeout and CPU hotplugMing Lei2015-04-231-3/+13
| | | | | | | | | | | | | | | | | Firstly during CPU hotplug, even queue is freezed, timeout handler still may come and access hctx->tags, which may cause use after free, so this patch deactivates timeout handler inside CPU hotplug notifier. Secondly, tags can be shared by more than one queues, so we have to check if the hctx has been unmapped, otherwise still use-after-free on tags can be triggered. Cc: <stable@vger.kernel.org> Reported-by: Dongsu Park <dongsu.park@profitbricks.com> Tested-by: Dongsu Park <dongsu.park@profitbricks.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: fix iteration of busy bitmapJens Axboe2015-04-171-3/+3
| | | | | | | | | | Commit 889fa31f00b2 was a bit too eager in reducing the loop count, so we ended up missing queues in some configurations. Ensure that our division rounds up, so that's not the case. Reported-by: Guenter Roeck <linux@roeck-us.net> Fixes: 889fa31f00b2 ("blk-mq: reduce unnecessary software queue looping") Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-4.1/core' of git://git.kernel.dk/linux-blockLinus Torvalds2015-04-173-29/+58
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer core bits from Jens Axboe: "This is the core pull request for 4.1. Not a lot of stuff in here for this round, mostly little fixes or optimizations. This pull request contains: - An optimization that speeds up queue runs on blk-mq, especially for the case where there's a large difference between nr_cpu_ids and the actual mapped software queues on a hardware queue. From Chong Yuan. - Honor node local allocations for requests on legacy devices. From David Rientjes. - Cleanup of blk_mq_rq_to_pdu() from me. - exit_aio() fixup from me, greatly speeding up exiting multiple IO contexts off exit_group(). For my particular test case, fio exit took ~6 seconds. A typical case of both exposing RCU grace periods to user space, and serializing exit of them. - Make blk_mq_queue_enter() honor the gfp mask passed in, so we only wait if __GFP_WAIT is set. From Keith Busch. - blk-mq exports and two added helpers from Mike Snitzer, which will be used by the dm-mq code. - Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang" * 'for-4.1/core' of git://git.kernel.dk/linux-block: blk-mq: reduce unnecessary software queue looping aio: fix serial draining in exit_aio() blk-mq: cleanup blk_mq_rq_to_pdu() blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue() block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set() block: allocate request memory local to request queue blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set blk-mq: export blk_mq_run_hw_queues blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
| * blk-mq: reduce unnecessary software queue loopingChong Yuan2015-04-151-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8. Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary. Change ->map_size to contain the actually mapped software queues, so we only loop for as many iterations as we have to. And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues() since they are all re-done in blk_mq_map_swqueue(). blk_mq_map_swqueue(). Signed-off-by: Chong Yuan <chong.yuan@memblaze.com> Reviewed-by: Wenbo Wang <wenbo.wang@memblaze.com> Updated by me for formatting and commenting. Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()Wei Fang2015-03-301-4/+1
| | | | | | | | | | | | | | Don't assign ->rq_timeout twice. Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: remove redundant check about 'set->nr_hw_queues' in ↵Xiaoguang Wang2015-03-301-1/+1
| | | | | | | | | | | | | | | | | | | | blk_mq_alloc_tag_set() At the beginning of blk_mq_alloc_tag_set(), we have already checked whether 'set->nr_hw_queues' is zero, so here remove this redundant check. Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: allocate request memory local to request queueDavid Rientjes2015-03-251-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_init_rl() allocates a mempool using mempool_create_node() with node local memory. This only allocates the mempool and element list locally to the requeue queue node. What we really want to do is allocate the request itself local to the queue. To do this, we need our own alloc and free functions that will allocate from request_cachep and pass the request queue node in to prefer node local memory. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't setKeith Busch2015-03-131-3/+6
| | | | | | | | | | | | | | | | | | Return -EBUSY if we're unable to enter a queue immediately when allocating a blk-mq request without __GFP_WAIT. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: export blk_mq_run_hw_queuesMike Snitzer2015-03-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | Rename blk_mq_run_queues to blk_mq_run_hw_queues, add async argument, and export it. DM's suspend support must be able to run the queue without starting stopped hw queues. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_diskMike Snitzer2015-03-132-10/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a variant of blk_mq_init_queue that allows a previously allocated queue to be initialized. blk_mq_init_allocated_queue models blk_init_allocated_queue -- which was also created for DM's use. DM's approach to device creation requires a placeholder request_queue be allocated for use with alloc_dev() but the decision about what type of request_queue will be ultimately created is deferred until all component devices referenced in the DM table are processed to determine the table type (request-based, blk-mq request-based, or bio-based). Also, because of DM's late finalization of the request_queue type the call to blk_mq_register_disk() doesn't happen during alloc_dev(). Must export blk_mq_register_disk() so that DM can backfill the 'mq' dir once the blk-mq queue is fully allocated. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge branch 'for-linus-1' of ↵Linus Torvalds2015-04-152-10/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: "Part one: - struct filename-related cleanups - saner iov_iter_init() replacements (and switching the syscalls to use of those) - ntfs switch to ->write_iter() (Anton) - aio cleanups and splitting iocb into common and async parts (Christoph) - assorted fixes (me, bfields, Andrew Elble) There's a lot more, including the completion of switchover to ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags race fixes, etc, but that goes after #for-davem merge. David has pulled it, and once it's in I'll send the next vfs pull request" * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits) sg_start_req(): use import_iovec() sg_start_req(): make sure that there's not too many elements in iovec blk_rq_map_user(): use import_single_range() sg_io(): use import_iovec() process_vm_access: switch to {compat_,}import_iovec() switch keyctl_instantiate_key_common() to iov_iter switch {compat_,}do_readv_writev() to {compat_,}import_iovec() aio_setup_vectored_rw(): switch to {compat_,}import_iovec() vmsplice_to_user(): switch to import_iovec() kill aio_setup_single_vector() aio: simplify arguments of aio_setup_..._rw() aio: lift iov_iter_init() into aio_setup_..._rw() lift iov_iter into {compat_,}do_readv_writev() NFS: fix BUG() crash in notify_change() with patch to chown_common() dcache: return -ESTALE not -EBUSY on distributed fs race NTFS: Version 2.1.32 - Update file write from aio_write to write_iter. VFS: Add iov_iter_fault_in_multipages_readable() drop bogus check in file_open_root() switch security_inode_getattr() to struct path * constify tomoyo_realpath_from_path() ...
| * | blk_rq_map_user(): use import_single_range()Al Viro2015-04-121-3/+3
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | sg_io(): use import_iovec()Al Viro2015-04-121-7/+5
| | | | | | | | | | | | | | | | | | ... and don't skip access_ok() validation. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | blk-mq: initialize 'struct request' and associated data to zeroLinus Torvalds2015-04-111-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jan Engelhardt reports a strange oops with an invalid ->sense_buffer pointer in scsi_init_cmd_errh() with the blk-mq code. The sense_buffer pointer should have been initialized by the call to scsi_init_request() from blk_mq_init_rq_map(), but there seems to be some non-repeatable memory corruptor. This patch makes sure we initialize the whole struct request allocation (and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by using __GFP_ZERO in the allocation. The old code initialized a couple of individual fields, leaving the rest undefined (although many of them are then initialized in later phases, like blk_mq_rq_ctx_init() etc. It's not entirely clear why this matters, but it's the rigth thing to do regardless, and with 4.0 imminent this is the defensive "let's just make sure everything is initialized properly" patch. Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | block: fix blk_stack_limits() regression due to lcm() changeMike Snitzer2015-03-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux 3.19 commit 69c953c ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n") caused blk_stack_limits() to not properly stack queue_limits for stacked devices (e.g. DM). Fix this regression by establishing lcm_not_zero() and switching blk_stack_limits() over to using it. DM uses blk_set_stacking_limits() to establish the initial top-level queue_limits that are then built up based on underlying devices' limits using blk_stack_limits(). In the case of optimal_io_size (io_opt) blk_set_stacking_limits() establishes a default value of 0. With commit 69c953c, lcm(0, n) is no longer n, which compromises proper stacking of the underlying devices' io_opt. Test: $ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536 $ cat /sys/block/sde/queue/optimal_io_size 786432 $ dmsetup create node --table "0 100 linear /dev/sde 0" Before this fix: $ cat /sys/block/dm-5/queue/optimal_io_size 0 After this fix: $ cat /sys/block/dm-5/queue/optimal_io_size 786432 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.19+ Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | Fix bug in blk_rq_merge_okWenbo Wang2015-03-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the right array index to reference the last element of rq->biotail->bi_io_vec[] Signed-off-by: Wenbo Wang <wenbo.wang@memblaze.com> Reviewed-by: Chong Yuan <chong.yuan@memblaze.com> Fixes: 66cb45aa41315 ("block: add support for limiting gaps in SG lists") Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
* | | blkmq: Fix NULL pointer deref when all reserved tags inSam Bradshaw2015-03-191-2/+4
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When allocating from the reserved tags pool, bt_get() is called with a NULL hctx. If all tags are in use, the hw queue is kicked to push out any pending IO, potentially freeing tags, and tag allocation is retried. The problem is that blk_mq_run_hw_queue() doesn't check for a NULL hctx. So we avoid it with a simple NULL hctx test. Tested by hammering mtip32xx with concurrent smartctl/hdparm. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Selvan Mani <smani@micron.com> Fixes: b32232073e80 ("blk-mq: fix hang in bt_get()") Cc: stable@kernel.org Added appropriate comment. Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: fix use of incorrect goto label in blk_mq_init_queue error pathMike Snitzer2015-03-131-3/+3
|/ | | | | | | | | | If percpu_ref_init() fails the allocated q and hctxs must get cleaned up; using 'err_map' doesn't allow that to happen. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Ming Lei <ming.lei@canonical.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-throttle: check stats_cpu before reading it from sysfsThadeu Lima de Souza Cascardo2015-02-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reading blkio.throttle.io_serviced in a recently created blkio cgroup, it's possible to race against the creation of a throttle policy, which delays the allocation of stats_cpu. Like other functions in the throttle code, just checking for a NULL stats_cpu prevents the following oops caused by that race. [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020 [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1] [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4 [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5 [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000 [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980 [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300 Not tainted (3.19.0) [ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR: 42008884 XER: 20000000 [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0 GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000 GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000 GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808 GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000 GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80 GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850 [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180 [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180 [ 1137.734943] Call Trace: [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable) [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0 [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70 [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150 [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50 [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510 [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200 [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80 [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0 [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100 [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98 [ 1137.735383] Instruction dump: [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24 [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018 And here is one code that allows to easily reproduce this, although this has first been found by running docker. void run(pid_t pid) { int n; int status; int fd; char *buffer; buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE); n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid); fd = open(CGPATH "/test/tasks", O_WRONLY); write(fd, buffer, n); close(fd); if (fork() > 0) { fd = open("/dev/sda", O_RDONLY | O_DIRECT); read(fd, buffer, 512); close(fd); wait(&status); } else { fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY); n = read(fd, buffer, BUFFER_SIZE); close(fd); } free(buffer); exit(0); } void test(void) { int status; mkdir(CGPATH "/test", 0666); if (fork() > 0) wait(&status); else run(getpid()); rmdir(CGPATH "/test"); } int main(int argc, char **argv) { int i; for (i = 0; i < NR_TESTS; i++) test(); return 0; } Reported-by: Ricardo Marin Matinata <rmm@br.ibm.com> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-3.20/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2015-02-121-47/+25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver changes from Jens Axboe: "This contains: - The 4k/partition fixes for brd from Boaz/Matthew. - A few xen front/back block fixes from David Vrabel and Roger Pau Monne. - Floppy changes from Takashi, cleaning the device file creation. - Switching libata to use the new blk-mq tagging policy, removing code (and a suboptimal implementation) from libata. This will throw you a merge conflict, since a bug in the original libata tagging code was fixed since this code was branched. Trivial. From Shaohua. - Conversion of loop to blk-mq, from Ming Lei. - Cleanup of the io_schedule() handling in bsg from Peter Zijlstra. He claims it improves on unreadable code, which will cost him a beer. - Maintainer update or NDB, now handled by Markus Pargmann. - NVMe: - Optimization from me that avoids a kmalloc/kfree per IO for smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU overhead. - Removal of (now) dead RCU code, a relic from before NVMe was converted to blk-mq" * 'for-3.20/drivers' of git://git.kernel.dk/linux-block: xen-blkback: default to X86_32 ABI on x86 xen-blkfront: fix accounting of reqs when migrating xen-blkback,xen-blkfront: add myself as maintainer block: Simplify bsg complete all floppy: Avoid manual call of device_create_file() NVMe: avoid kmalloc/kfree for smaller IO MAINTAINERS: Update NBD maintainer libata: make sata_sil24 use fifo tag allocator libata: move sas ata tag allocation to libata-scsi.c libata: use blk taging NVMe: within nvme_free_queues(), delete RCU sychro/deferred free null_blk: suppress invalid partition info brd: Request from fdisk 4k alignment brd: Fix all partitions BUGs axonram: Fix bug in direct_access loop: add blk-mq.h include block: loop: don't handle REQ_FUA explicitly block: loop: introduce lo_discard() and lo_req_flush() block: loop: say goodby to bio block: loop: improve performance via blk-mq
| * block: Simplify bsg complete allPeter Zijlstra2015-02-041-47/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It took me a few tries to figure out what this code did; lets rewrite it into a more regular form. The thing that makes this one 'special' is the BSG_F_BLOCK flag, if that is not set we're not supposed/allowed to block and should spin wait for completion. The (new) io_wait_event() will never see a false condition in case of the spinning and we will therefore not block. Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-blockLinus Torvalds2015-02-1213-486/+372
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block IO changes from Jens Axboe: "This contains: - A series from Christoph that cleans up and refactors various parts of the REQ_BLOCK_PC handling. Contributions in that series from Dongsu Park and Kent Overstreet as well. - CFQ: - A bug fix for cfq for realtime IO scheduling from Jeff Moyer. - A stable patch fixing a potential crash in CFQ in OOM situations. From Konstantin Khlebnikov. - blk-mq: - Add support for tag allocation policies, from Shaohua. This is a prep patch enabling libata (and other SCSI parts) to use the blk-mq tagging, instead of rolling their own. - Various little tweaks from Keith and Mike, in preparation for DM blk-mq support. - Minor little fixes or tweaks from me. - A double free error fix from Tony Battersby. - The partition 4k issue fixes from Matthew and Boaz. - Add support for zero+unprovision for blkdev_issue_zeroout() from Martin" * 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits) block: remove unused function blk_bio_map_sg block: handle the null_mapped flag correctly in blk_rq_map_user_iov blk-mq: fix double-free in error path block: prevent request-to-request merging with gaps if not allowed blk-mq: make blk_mq_run_queues() static dm: fix multipath regression due to initializing wrong request cfq-iosched: handle failure of cfq group allocation block: Quiesce zeroout wrapper block: rewrite and split __bio_copy_iov() block: merge __bio_map_user_iov into bio_map_user_iov block: merge __bio_map_kern into bio_map_kern block: pass iov_iter to the BLOCK_PC mapping functions block: add a helper to free bio bounce buffer pages block: use blk_rq_map_user_iov to implement blk_rq_map_user block: simplify bio_map_kern block: mark blk-mq devices as stackable block: keep established cmd_flags when cloning into a blk-mq request block: add blk-mq support to blk_insert_cloned_request() block: require blk_rq_prep_clone() be given an initialized clone request blk-mq: add tag allocation policy ...
| * | block: remove unused function blk_bio_map_sgChristoph Hellwig2015-02-111-29/+0
| | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: handle the null_mapped flag correctly in blk_rq_map_user_iovChristoph Hellwig2015-02-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tape drivers (and the sg driver in a special case that doesn't matter here) use the null_mapped flag to tell blk_rq_map_user to not copy around any data into or out of the bounce buffers. blk_rq_map_user_iov never got that treatment, which didn't matter until I refactored blk_rq_map_user to be implemented in terms of blk_rq_map_user_iov. Signed-off-by: Christoph Hellwig <hch@lst.de> Fixes: ddad8dd0a162 ("block: use blk_rq_map_user_iov to implement blk_rq_map_user") Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: fix double-free in error pathTony Battersby2015-02-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the allocation of bt->bs fails, then bt->map can be freed twice, once in blk_mq_init_bitmap_tags() -> bt_alloc(), and once in blk_mq_init_bitmap_tags() -> bt_free(). Fix by setting the pointer to NULL after the first free. Cc: <stable@vger.kernel.org> Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: prevent request-to-request merging with gaps if not allowedKeith Busch2015-02-111-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the queue has SG_GAPS set, we must not merge across an sg gap. This is caught for the bio case, but currently not for the more rare case of merging two requests directly. Signed-off-by: Keith Busch <keith.busch@intel.com> Cut the dm bits, those will go through the dm tree, and fixed the test_bit() test. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: make blk_mq_run_queues() staticJens Axboe2015-02-101-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | We no longer use it outside of blk-mq.c, so we can make it static and stop exporting it. Additionally, kill the 'async' argument, as there's only one used of it. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | cfq-iosched: handle failure of cfq group allocationKonstantin Khlebnikov2015-02-091-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC. In cfq_find_alloc_queue() possible allocation failure is not handled. As a result kernel oopses on NULL pointer dereference when cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer. Bug was introduced in v3.5 in commit cd1604fab4f9 ("blkcg: factor out blkio_group creation"). Prior to that commit cfq group lookup had returned pointer to root group as fallback. This patch handles this error using existing fallback oom_cfqq. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Fixes: cd1604fab4f9 ("blkcg: factor out blkio_group creation") Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: Quiesce zeroout wrapperMartin K. Petersen2015-02-051-19/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkdev_issue_zeroout() printed a warning if a device failed a discard or write same request despite advertising support for these. That's fine for SCSI since we'll disable these commands if we get an error back from the disk saying that they are not supported. And consequently the warning only gets printed once. There are other types of block devices that support discard, however, and these may return -EOPNOTSUPP for each command but leave discard enabled in the queue limits. This will cause a warning message for every blkdev_issue_zeroout() invocation. Remove the offending warning messages. Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: rewrite and split __bio_copy_iov()Dongsu Park2015-02-051-34/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rewrite __bio_copy_iov using the copy_page_{from,to}_iter helpers, and split it into two simpler functions. This commit should contain only literal replacements, without functional changes. Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dongsu Park <dongsu.park@profitbricks.com> [hch: removed the __bio_copy_iov wrapper] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: merge __bio_map_user_iov into bio_map_user_iovChristoph Hellwig2015-02-052-37/+21
| | | | | | | | | | | | | | | | | | | | | | | | And also remove the unused bdev argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: merge __bio_map_kern into bio_map_kernChristoph Hellwig2015-02-051-33/+17
| | | | | | | | | | | | | | | | | | | | | | | | This saves a little code, and allow to simplify the error handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: pass iov_iter to the BLOCK_PC mapping functionsKent Overstreet2015-02-053-111/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make use of a new interface provided by iov_iter, backed by scatter-gather list of iovec, instead of the old interface based on sg_iovec. Also use iov_iter_advance() instead of manual iteration. This commit should contain only literal replacements, without functional changes. Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dongsu.park@profitbricks.com> [hch: fixed to do a deep clone of the iov_iter, and to properly use the iov_iter direction] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: add a helper to free bio bounce buffer pagesChristoph Hellwig2015-02-051-32/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code sniplet to walk all bio_vecs and free their pages is opencoded in way to many places, so factor it into a helper. Also convert the slightly more complex cases in bio_kern_endio and __bio_copy_iov where we break the freeing from an existing loop into a separate one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: use blk_rq_map_user_iov to implement blk_rq_map_userChristoph Hellwig2015-02-052-176/+14
| | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: simplify bio_map_kernChristoph Hellwig2015-02-051-16/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just open code the trivial mapping from a kernel virtual address to a bio instead of going through the complex user address mapping machinery. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: keep established cmd_flags when cloning into a blk-mq requestKeith Busch2015-01-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_alloc_request() may establish REQ_MQ_INFLIGHT in addition to incrementing the hctx->nr_active count. Any cmd_flags that are established in the newly allocated clone request must be preserved in addition to the cmd_flags that are later copied over from the original request as part of blk_rq_prep_clone(). Otherwise, if REQ_MQ_INFLIGHT isn't set in the clone request the hctx->nr_active count won't get decremented via blk_mq_free_request(). The only consumer of blk_rq_prep_clone() is request-based DM, which uses blk_rq_init() prior to calling blk_rq_prep_clone() for the non-blk-mq case. Given the cloned request's cmd_flags will be 0 it is safe to OR them with the original request's cmd_flags for both the non-blk-mq and blk-mq cases. Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: add blk-mq support to blk_insert_cloned_request()Keith Busch2015-01-281-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | If the request passed to blk_insert_cloned_request() was allocated by a blk-mq device it must be submitted using blk_mq_insert_request(). Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: require blk_rq_prep_clone() be given an initialized clone requestKeith Busch2015-01-281-2/+0
| |/ | | | | | | | | | | | | | | | | | | Prepare to allow blk_rq_prep_clone() to accept clone requests that were allocated from blk-mq request queues. As such the blk_rq_prep_clone() caller must first initialize the clone request. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: add tag allocation policyShaohua Li2015-01-233-17/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the blk-mq part to support tag allocation policy. The default allocation policy isn't changed (though it's not a strict FIFO). The new policy is round-robin for libata. But it's a try-best implementation. If multiple tasks are competing, the tags returned will be mixed (which is unavoidable even with !mq, as requests from different tasks can be mixed in queue) Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: support different tag allocation policyShaohua Li2015-01-231-8/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The libata tag allocation is using a round-robin policy. Next patch will make libata use block generic tag allocation, so let's add a policy to tag allocation. Currently two policies: FIFO (default) and round-robin. Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Remove annoying "unknown partition table" messageBoaz Harrosh2015-01-221-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | As Christoph put it: Can we just get rid of the warnings? It's fairly annoying as devices without partitions are perfectly fine and very useful. Me too I see this message every VM boot for ages on all my devices. Would love to just remove it. For me a partition-table is only needed for a booting BIOS, grub, and stuff. CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Add discard flag to blkdev_issue_zeroout() functionMartin K. Petersen2015-01-212-5/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkdev_issue_discard() will zero a given block range. This is done by way of explicit writing, thus provisioning or allocating the blocks on disk. There are use cases where the desired behavior is to zero the blocks but unprovision them if possible. The blocks must deterministically contain zeroes when they are subsequently read back. This patch adds a flag to blkdev_issue_zeroout() that provides this variant. If the discard flag is set and a block device guarantees discard_zeroes_data we will use REQ_DISCARD to clear the block range. If the device does not support discard_zeroes_data or if the discard request fails we will fall back to first REQ_WRITE_SAME and then a regular REQ_WRITE. Also update the callers of blkdev_issue_zero() to reflect the new flag and make sb_issue_zeroout() prefer the discard approach. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * cfq-iosched: fix incorrect filing of rt async cfqqJeff Moyer2015-01-211-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hi, If you can manage to submit an async write as the first async I/O from the context of a process with realtime scheduling priority, then a cfq_queue is allocated, but filed into the wrong async_cfqq bucket. It ends up in the best effort array, but actually has realtime I/O scheduling priority set in cfqq->ioprio. The reason is that cfq_get_queue assumes the default scheduling class and priority when there is no information present (i.e. when the async cfqq is created): static struct cfq_queue * cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic, struct bio *bio, gfp_t gfp_mask) { const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio); cic->ioprio starts out as 0, which is "invalid". So, class of 0 (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so: async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio); static struct cfq_queue ** cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio) { switch (ioprio_class) { case IOPRIO_CLASS_RT: return &cfqd->async_cfqq[0][ioprio]; case IOPRIO_CLASS_NONE: ioprio = IOPRIO_NORM; /* fall through */ case IOPRIO_CLASS_BE: return &cfqd->async_cfqq[1][ioprio]; case IOPRIO_CLASS_IDLE: return &cfqd->async_idle_cfqq; default: BUG(); } } Here, instead of returning a class mapped from the process' scheduling priority, we get back the bucket associated with IOPRIO_CLASS_BE. Now, there is no queue allocated there yet, so we create it: cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask); That function ends up doing this: cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync); cfq_init_prio_data(cfqq, cic); cfq_init_cfqq marks the priority as having changed. Then, cfq_init_prio data does this: ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); switch (ioprio_class) { default: printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class); case IOPRIO_CLASS_NONE: /* * no prio set, inherit CPU scheduling settings */ cfqq->ioprio = task_nice_ioprio(tsk); cfqq->ioprio_class = task_nice_ioclass(tsk); break; So we basically have two code paths that treat IOPRIO_CLASS_NONE differently, which results in an RT async cfqq filed into a best effort bucket. Attached is a patch which fixes the problem. I'm not sure how to make it cleaner. Suggestions would be welcome. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: fix false negative out-of-tags conditionJens Axboe2015-01-141-17/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The blk-mq tagging tries to maintain some locality between CPUs and the tags issued. The tags are split into groups of words, and the words may not be fully populated. When searching for a new free tag, blk-mq may look at partial words, hence it passes in an offset/size to find_next_zero_bit(). However, it does that wrong, the size must always be the full length of the number of tags in that word, otherwise we'll potentially miss some near the end. Another issue is when __bt_get() goes from one word set to the next. It bumps the index, but not the last_tag associated with the previous index. Bump that to be in the range of the new word. Finally, clean up __bt_get() and __bt_get_word() a bit and get rid of the goto in there, and the unnecessary 'wrap' variable. Signed-off-by: Jens Axboe <axboe@fb.com>