summaryrefslogtreecommitdiffstats
path: root/drivers/block/nbd.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-blockLinus Torvalds2021-11-091-23/+21
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block driver updates from Jens Axboe: - Last series adding error handling support for add_disk() in drivers. After this one, and once the SCSI side has been merged, we can finally annotate add_disk() as must_check. (Luis) - bcache fixes (Coly) - zram fixes (Ming) - ataflop locking fix (Tetsuo) - nbd fixes (Ye, Yu) - MD merge via Song - Cleanup (Yang) - sysfs fix (Guoqing) - Misc fixes (Geert, Wu, luo) * tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block: (34 commits) bcache: Revert "bcache: use bvec_virt" ataflop: Add missing semicolon to return statement floppy: address add_disk() error handling on probe ataflop: address add_disk() error handling on probe block: update __register_blkdev() probe documentation ataflop: remove ataflop_probe_lock mutex mtd/ubi/block: add error handling support for add_disk() block/sunvdc: add error handling support for add_disk() z2ram: add error handling support for add_disk() nvdimm/pmem: use add_disk() error handling nvdimm/pmem: cleanup the disk if pmem_release_disk() is yet assigned nvdimm/blk: add error handling support for add_disk() nvdimm/blk: avoid calling del_gendisk() on early failures nvdimm/btt: add error handling support for add_disk() nvdimm/btt: use goto error labels on btt_blk_init() loop: Remove duplicate assignments drbd: Fix double free problem in drbd_create_device nvdimm/btt: do not call del_gendisk() if not needed bcache: fix use-after-free problem in bcache_device_free() zram: replace fsync_bdev with sync_blockdev ...
| * nbd: error out if socket index doesn't match in nbd_handle_reply()Yu Kuai2021-11-021-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | commit fcf3d633d8e1 ("nbd: check sock index in nbd_read_stat()") just add error message when socket index doesn't match. Since the request and reply must be transmitted over the same socket, it's ok to error out in such situation. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20211101092538.1155842-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: Fix hungtask when nbd_config_putYe Bin2021-11-021-20/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got follow issue: [ 247.381177] INFO: task kworker/u10:0:47 blocked for more than 120 seconds. [ 247.382644] Not tainted 4.19.90-dirty #140 [ 247.383502] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.385027] Call Trace: [ 247.388384] schedule+0xb8/0x3c0 [ 247.388966] schedule_timeout+0x2b4/0x380 [ 247.392815] wait_for_completion+0x367/0x510 [ 247.397713] flush_workqueue+0x32b/0x1340 [ 247.402700] drain_workqueue+0xda/0x3c0 [ 247.403442] destroy_workqueue+0x7b/0x690 [ 247.405014] nbd_config_put.cold+0x2f9/0x5b6 [ 247.405823] recv_work+0x1fd/0x2b0 [ 247.406485] process_one_work+0x70b/0x1610 [ 247.407262] worker_thread+0x5a9/0x1060 [ 247.408699] kthread+0x35e/0x430 [ 247.410918] ret_from_fork+0x1f/0x30 We can reproduce issue as follows: 1. Inject memory fault in nbd_start_device -1244,10 +1248,18 @@ static int nbd_start_device(struct nbd_device *nbd) nbd_dev_dbg_init(nbd); for (i = 0; i < num_connections; i++) { struct recv_thread_args *args; - - args = kzalloc(sizeof(*args), GFP_KERNEL); + + if (i == 1) { + args = NULL; + printk("%s: inject malloc error\n", __func__); + } + else + args = kzalloc(sizeof(*args), GFP_KERNEL); 2. Inject delay in recv_work -757,6 +760,8 @@ static void recv_work(struct work_struct *work) blk_mq_complete_request(blk_mq_rq_from_pdu(cmd)); } + printk("%s: comm=%s pid=%d\n", __func__, current->comm, current->pid); + mdelay(5 * 1000); nbd_config_put(nbd); atomic_dec(&config->recv_threads); wake_up(&config->recv_wq); 3. Create nbd server nbd-server 8000 /tmp/disk 4. Create nbd client nbd-client localhost 8000 /dev/nbd1 Then will trigger above issue. Reason is when add delay in recv_work, lead to release the last reference of 'nbd->config_refs'. nbd_config_put will call flush_workqueue to make all work finish. Obviously, it will lead to deadloop. To solve this issue, according to Josef's suggestion move 'recv_work' init from start device to nbd_dev_add, then destroy 'recv_work'when nbd device teardown. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20211102015237.2309763-5-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: Fix incorrect error handle when first_minor is illegal in nbd_dev_addYe Bin2021-11-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | If first_minor is illegal will goto out_free_idr label, this will miss cleanup disk. Fixes: b1a811633f73 ("block: nbd: add sanity check for first_minor") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20211102015237.2309763-4-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: fix possible overflow for 'first_minor' in nbd_dev_add()Yu Kuai2021-11-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | If 'part_shift' is not zero, then 'index << part_shift' might overflow to a value that is not greater than '0xfffff', then sysfs might complains about duplicate creation. Fixes: b0d9111a2d53 ("nbd: use an idr to keep track of nbd devices") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20211102015237.2309763-3-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: fix max value for 'first_minor'Yu Kuai2021-11-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b1a811633f73 ("block: nbd: add sanity check for first_minor") checks that 'first_minor' should not be greater than 0xff, which is wrong. Whitout the commit, the details that when user pass 0x100000, it ends up create sysfs dir "/sys/block/43:0" are as follows: nbd_dev_add disk->first_minor = index << part_shift -> default part_shift is 5, first_minor is 0x2000000 device_add_disk ddev->devt = MKDEV(disk->major, disk->first_minor) -> (0x2b << 20) | (0x2000000) = 0x2b00000 device_add device_create_sys_dev_entry format_dev_t sprintf(buffer, "%u:%u", MAJOR(dev), MINOR(dev)); -> got 43:0 sysfs_create_link -> /sys/block/43:0 By the way, with the wrong fix, when part_shift is the default value, only 8 ndb devices can be created since 8 << 5 is greater than 0xff. Since the max bits for 'first_minor' should be the same as what MKDEV() does, which is 20. Change the upper bound of 'first_minor' from 0xff to 0xfffff. Fixes: b1a811633f73 ("block: nbd: add sanity check for first_minor") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20211102015237.2309763-2-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds2021-11-011-49/+112
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: - paride driver cleanups (Christoph) - Remove cryptoloop support (Christoph) - null_blk poll support (me) - Now that add_disk() supports proper error handling, add it to various drivers (Luis) - Make ataflop actually work again (Michael) - s390 dasd fixes (Stefan, Heiko) - nbd fixes (Yu, Ye) - Remove redundant wq flush in mtip32xx (Christophe) - NVMe updates - fix a multipath partition scanning deadlock (Hannes Reinecke) - generate uevent once a multipath namespace is operational again (Hannes Reinecke) - support unique discovery controller NQNs (Hannes Reinecke) - fix use-after-free when a port is removed (Israel Rukshin) - clear shadow doorbell memory on resets (Keith Busch) - use struct_size (Len Baker) - add error handling support for add_disk (Luis Chamberlain) - limit the maximal queue size for RDMA controllers (Max Gurtovoy) - use a few more symbolic names (Max Gurtovoy) - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy) - add support for ->map_queues on FC (Saurav Kashyap) - support the current discovery subsystem entry (Hannes Reinecke) - use flex_array_size and struct_size (Len Baker) - bcache fixes (Christoph, Coly, Chao, Lin, Qing) - MD updates (Christoph, Guoqing, Xiao) - Misc fixes (Dan, Ding, Jiapeng, Shin'ichiro, Ye) * tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block: (117 commits) null_blk: Fix handling of submit_queues and poll_queues attributes block: ataflop: Fix warning comparing pointer to 0 bcache: replace snprintf in show functions with sysfs_emit bcache: move uapi header bcache.h to bcache code directory nvmet: use flex_array_size and struct_size nvmet: register discovery subsystem as 'current' nvmet: switch check for subsystem type nvme: add new discovery log page entry definitions block: ataflop: more blk-mq refactoring fixes block: remove support for cryptoloop and the xor transfer mtd: add add_disk() error handling rnbd: add error handling support for add_disk() um/drivers/ubd_kern: add error handling support for add_disk() m68k/emu/nfblock: add error handling support for add_disk() xen-blkfront: add error handling support for add_disk() bcache: add error handling support for add_disk() dm: add add_disk() error handling block: aoe: fixup coccinelle warnings nvmet: use struct_size over open coded arithmetic nvme: drop scan_lock and always kick requeue list when removing namespaces ...
| * nbd: Fix use-after-free in pid_showYe Bin2021-10-201-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got issue as follows: [ 263.886511] BUG: KASAN: use-after-free in pid_show+0x11f/0x13f [ 263.888359] Read of size 4 at addr ffff8880bf0648c0 by task cat/746 [ 263.890479] CPU: 0 PID: 746 Comm: cat Not tainted 4.19.90-dirty #140 [ 263.893162] Call Trace: [ 263.893509] dump_stack+0x108/0x15f [ 263.893999] print_address_description+0xa5/0x372 [ 263.894641] kasan_report.cold+0x236/0x2a8 [ 263.895696] __asan_report_load4_noabort+0x25/0x30 [ 263.896365] pid_show+0x11f/0x13f [ 263.897422] dev_attr_show+0x48/0x90 [ 263.898361] sysfs_kf_seq_show+0x24d/0x4b0 [ 263.899479] kernfs_seq_show+0x14e/0x1b0 [ 263.900029] seq_read+0x43f/0x1150 [ 263.900499] kernfs_fop_read+0xc7/0x5a0 [ 263.903764] vfs_read+0x113/0x350 [ 263.904231] ksys_read+0x103/0x270 [ 263.905230] __x64_sys_read+0x77/0xc0 [ 263.906284] do_syscall_64+0x106/0x360 [ 263.906797] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproduce this issue as follows: 1. nbd-server 8000 /tmp/disk 2. nbd-client localhost 8000 /dev/nbd1 3. cat /sys/block/nbd1/pid Then trigger use-after-free in pid_show. Reason is after do step '2', nbd-client progress is already exit. So it's task_struct already freed. To solve this issue, revert part of 6521d39a64b3's modify and remove useless 'recv_task' member of nbd_device. Fixes: 6521d39a64b3 ("nbd: Remove variable 'pid'") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20211020073959.2679255-1-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: fix uaf in nbd_handle_reply()Yu Kuai2021-10-181-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a problem that nbd_handle_reply() might access freed request: 1) At first, a normal io is submitted and completed with scheduler: internel_tag = blk_mq_get_tag -> get tag from sched_tags blk_mq_rq_ctx_init sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag] ... blk_mq_get_driver_tag __blk_mq_get_driver_tag -> get tag from tags tags->rq[tag] = sched_tag->static_rq[internel_tag] So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing to the request: sched_tags->static_rq[internal_tag]. Even if the io is finished. 2) nbd server send a reply with random tag directly: recv_work nbd_handle_reply blk_mq_tag_to_rq(tags, tag) rq = tags->rq[tag] 3) if the sched_tags->static_rq is freed: blk_mq_sched_free_requests blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i) -> step 2) access rq before clearing rq mapping blk_mq_clear_rq_mapping(set, tags, hctx_idx); __free_pages() -> rq is freed here 4) Then, nbd continue to use the freed request in nbd_handle_reply Fix the problem by get 'q_usage_counter' before blk_mq_tag_to_rq(), thus request is ensured not to be freed because 'q_usage_counter' is not zero. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210916141810.2325276-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: partition nbd_read_stat() into nbd_read_reply() and nbd_handle_reply()Yu Kuai2021-10-181-30/+44
| | | | | | | | | | | | | | | | | | | | Prepare to fix uaf in nbd_read_stat(), no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-7-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: clean up return value checking of sock_xmit()Yu Kuai2021-10-181-6/+7
| | | | | | | | | | | | | | | | | | | | | | Check if sock_xmit() return 0 is useless because it'll never return 0, comment it and remove such checkings. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-6-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: don't start request if nbd_queue_rq() failedYu Kuai2021-10-181-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6a468d5990ec ("nbd: don't start req until after the dead connection logic") move blk_mq_start_request() from nbd_queue_rq() to nbd_handle_cmd() to skip starting request if the connection is dead. However, request is still started in other error paths. Currently, blk_mq_end_request() will be called immediately if nbd_queue_rq() failed, thus start request in such situation is useless. So remove blk_mq_start_request() from error paths in nbd_handle_cmd(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-5-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: check sock index in nbd_read_stat()Yu Kuai2021-10-181-0/+4
| | | | | | | | | | | | | | | | | | | | | | The sock that clent send request in nbd_send_cmd() and receive reply in nbd_read_stat() should be the same. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-4-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: make sure request completion won't concurrentYu Kuai2021-10-181-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cddce0116058 ("nbd: Aovid double completion of a request") try to fix that nbd_clear_que() and recv_work() can complete a request concurrently. However, the problem still exists: t1 t2 t3 nbd_disconnect_and_put flush_workqueue recv_work blk_mq_complete_request blk_mq_complete_request_remote -> this is true WRITE_ONCE(rq->state, MQ_RQ_COMPLETE) blk_mq_raise_softirq blk_done_softirq blk_complete_reqs nbd_complete_rq blk_mq_end_request blk_mq_free_request WRITE_ONCE(rq->state, MQ_RQ_IDLE) nbd_clear_que blk_mq_tagset_busy_iter nbd_clear_req __blk_mq_free_request blk_mq_put_tag blk_mq_complete_request -> complete again There are three places where request can be completed in nbd: recv_work(), nbd_clear_que() and nbd_xmit_timeout(). Since they all hold cmd->lock before completing the request, it's easy to avoid the problem by setting and checking a cmd flag. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-3-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: don't handle response without a corresponding request messageYu Kuai2021-10-181-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While handling a response message from server, nbd_read_stat() will try to get request by tag, and then complete the request. However, this is problematic if nbd haven't sent a corresponding request message: t1 t2 submit_bio nbd_queue_rq blk_mq_start_request recv_work nbd_read_stat blk_mq_tag_to_rq blk_mq_complete_request nbd_send_cmd Thus add a new cmd flag 'NBD_CMD_INFLIGHT', it will be set in nbd_send_cmd() and checked in nbd_read_stat(). Noted that this patch can't fix that blk_mq_tag_to_rq() might return a freed request, and this will be fixed in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-2-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: add error handling support for add_disk()Luis Chamberlain2021-10-181-1/+5
| | | | | | | | | | | | | | | | | | | | | | We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | nbd: Use blk_validate_block_size() to validate block sizeXie Yongji2021-10-271-1/+2
| | | | | | | | | | | | | | | | | | Use the block layer helper to validate block size instead of open coding it. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Link: https://lore.kernel.org/r/20211026144015.188-3-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | nbd: Use invalidate_disk() helper on disconnectXie Yongji2021-10-211-9/+3
|/ | | | | | | | | | | | | | | When a nbd device encounters a writeback error, that error will get propagated to the bd_inode's wb_err field. Then if this nbd device's backend is disconnected and another is attached, we will get back the previous writeback error on fsync, which is unexpected. To fix it, let's use invalidate_disk() helper to invalidate the disk on disconnect instead of just setting disk's capacity to zero. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210922123711.187-5-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: use shifts rather than multipliesNick Desaulniers2021-09-301-12/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fad7cd3310db ("nbd: add the check to prevent overflow in __nbd_ioctl()") raised an issue from the fallback helpers added in commit f0907827a8a9 ("compiler.h: enable builtin overflow checkers and add fallback code") ERROR: modpost: "__divdi3" [drivers/block/nbd.ko] undefined! As Stephen Rothwell notes: The added check_mul_overflow() call is being passed 64 bit values. COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW is not set for this build (see include/linux/overflow.h). Specifically, the helpers for checking whether the results of a multiplication overflowed (__unsigned_mul_overflow, __signed_add_overflow) use the division operator when !COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW. This is problematic for 64b operands on 32b hosts. This was fixed upstream by commit 76ae847497bc ("Documentation: raise minimum supported version of GCC to 5.1") which is not suitable to be backported to stable. Further, __builtin_mul_overflow() would emit a libcall to a compiler-rt-only symbol when compiling with clang < 14 for 32b targets. ld.lld: error: undefined symbol: __mulodi4 In order to keep stable buildable with GCC 4.9 and clang < 14, modify struct nbd_config to instead track the number of bits of the block size; reconstructing the block size using runtime checked shifts that are not problematic for those compilers and in a ways that can be backported to stable. In nbd_set_size, we do validate that the value of blksize must be a power of two (POT) and is in the range of [512, PAGE_SIZE] (both inclusive). This does modify the debugfs interface. Cc: stable@vger.kernel.org Cc: Arnd Bergmann <arnd@kernel.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Link: https://github.com/ClangBuiltLinux/linux/issues/1438 Link: https://lore.kernel.org/all/20210909182525.372ee687@canb.auug.org.au/ Link: https://lore.kernel.org/stable/CAHk-=whiQBofgis_rkniz8GBP9wZtSZdcDEffgSLO62BUGV3gg@mail.gmail.com/ Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Nathan Chancellor <nathan@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Suggested-by: Kees Cook <keescook@chromium.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210920232533.4092046-1-ndesaulniers@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-311-79/+99
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "Sitting on top of the core block changes, here are the driver changes for the 5.15 merge window: - NVMe updates via Christoph: - suspend improvements for devices with an HMB (Keith Busch) - handle double completions more gacefull (Sagi Grimberg) - cleanup the selects for the nvme core code a bit (Sagi Grimberg) - don't update queue count when failing to set io queues (Ruozhu Li) - various nvmet connect fixes (Amit Engel) - cleanup lightnvm leftovers (Keith Busch, me) - small cleanups (Colin Ian King, Hou Pu) - add tracing for the Set Features command (Hou Pu) - CMB sysfs cleanups (Keith Busch) - add a mutex_destroy call (Keith Busch) - remove lightnvm subsystem. It's served its purpose and ultimately led to zoned nvme support, we no longer need it (Christoph) - revert floppy O_NDELAY fix (Denis) - nbd fixes (Hou, Pavel, Baokun) - nbd locking fixes (Tetsuo) - nbd device removal fixes (Christoph) - raid10 rcu warning fix (Xiao) - raid1 write behind fix (Guoqing) - rnbd fixes (Gioh, Md Haris) - misc fixes (Colin)" * tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits) Revert "floppy: reintroduce O_NDELAY fix" raid1: ensure write behind bio has less than BIO_MAX_VECS sectors md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard nbd: remove nbd->destroy_complete nbd: only return usable devices from nbd_find_unused nbd: set nbd->index before releasing nbd_index_mutex nbd: prevent IDR lookups from finding partially initialized devices nbd: reset NBD to NULL when restarting in nbd_genl_connect nbd: add missing locking to the nbd_dev_add error path nvme: remove the unused NVME_NS_* enum nvme: remove nvm_ndev from ns nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers block: nbd: add sanity check for first_minor nvmet: check that host sqsize does not exceed ctrl MQES nvmet: avoid duplicate qid in connect cmd nvmet: pass back cntlid on successful completion nvme-rdma: don't update queue count when failing to set io queues nvme-tcp: don't update queue count when failing to set io queues nvme-tcp: pair send_mutex init with destroy nvme: allow user toggling hmb usage ...
| * nbd: remove nbd->destroy_completeChristoph Hellwig2021-08-251-38/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nbd->destroy_complete pointer is not really needed. For creating a device without a specific index we now simplify skip devices marked NBD_DESTROY_ON_DISCONNECT as there is not much point to reuse them. For device creation with a specific index there is no real need to treat the case of a requested but not finished disconnect different than any other device that is being shutdown, i.e. we can just return an error, as a slightly different race window would anyway. Fixes: 6e4df4c64881 ("nbd: reduce the nbd_index_mutex scope") Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot+2c98885bcd769f56b6d6@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-7-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: only return usable devices from nbd_find_unusedChristoph Hellwig2021-08-251-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | Device marked as NBD_DESTROY_ON_DISCONNECT can and should be skipped given that they won't survive the disconnect. So skip them and try to grab a reference directly and just continue if the the devices is being torn down or created and thus has a zero refcount. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-6-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: set nbd->index before releasing nbd_index_mutexTetsuo Handa2021-08-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Set nbd->index before releasing nbd_index_mutex, as populate_nbd_status() might access nbd->index as soon as nbd_index_mutex is released. Fixes: 6e4df4c64881 ("nbd: reduce the nbd_index_mutex scope") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-5-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: prevent IDR lookups from finding partially initialized devicesTetsuo Handa2021-08-251-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously nbd_index_mutex was held during whole add/remove/lookup operations in order to guarantee that partially initialized devices are not reachable via idr_find() or idr_for_each(). But now that partially initialized devices become reachable as soon as idr_alloc() succeeds, we need to skip partially initialized devices. Since it seems that all functions use refcount_inc_not_zero(&nbd->refs) in order to skip destroying devices, update nbd->refs from zero to non-zero as the last step of device initialization in order to also skip partially initialized devices. Fixes: 6e4df4c64881 ("nbd: reduce the nbd_index_mutex scope") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [hch: split from a larger patch, added comments] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-4-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: reset NBD to NULL when restarting in nbd_genl_connectChristoph Hellwig2021-08-251-14/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When nbd_genl_connect restarts to wait for a disconnecting device, nbd needs to be reset to NULL. Do that by facoring out a helper to find an unused device. Fixes: 6177b56c96ff ("nbd: refactor device search and allocation in nbd_genl_connect") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-3-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: add missing locking to the nbd_dev_add error pathTetsuo Handa2021-08-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | idr_remove needs external synchronization. Fixes: 6e4df4c64881 ("nbd: reduce the nbd_index_mutex scope") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210825163108.50713-2-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: nbd: add sanity check for first_minorPavel Skripkin2021-08-161-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot hit WARNING in internal_create_group(). The problem was in too big disk->first_minor. disk->first_minor is initialized by value, which comes from userspace and there wasn't any sanity checks about value correctness. It can cause duplicate creation of sysfs files/links, because disk->first_minor will be passed to MKDEV() which causes truncation to byte. Since maximum minor value is 0xff, let's check if first_minor is correct minor number. NOTE: the root case of the reported warning was in wrong error handling in register_disk(), but we can avoid passing knowingly wrong values to sysfs API, because sysfs error messages can confuse users. For example: user passed 1048576 as index, but sysfs complains about duplicate creation of /dev/block/43:0. It's not obvious how 1048576 becomes 0. Log and reproducer for above example can be found on syzkaller bug report page. Link: https://syzkaller.appspot.com/bug?id=03c2ae9146416edf811958d5fd7acfab75b143d1 Fixes: b0d9111a2d53 ("nbd: use an idr to keep track of nbd devices") Reported-by: syzbot+9937dc42271cd87d4b98@syzkaller.appspotmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: reduce the nbd_index_mutex scopeChristoph Hellwig2021-08-131-27/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | nbd_index_mutex is currently held over add_disk and inside ->open, which leads to lock order reversals. Refactor the device creation code path so that nbd_dev_add is called without nbd_index_mutex lock held and only takes it for the IDR insertation. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210811124428.2368491-7-hch@lst.de [axboe: fix whitespace] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: refactor device search and allocation in nbd_genl_connectChristoph Hellwig2021-08-131-31/+14
| | | | | | | | | | | | | | | | | | | | | | Use idr_for_each_entry instead of the awkward callback to find an existing device for the index == -1 case, and de-duplicate the device allocation if no existing device was found. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210811124428.2368491-6-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: return the allocated nbd_device from nbd_dev_addChristoph Hellwig2021-08-131-12/+9
| | | | | | | | | | | | | | | | | | | | Return the device we just allocated instead of doing an extra search for it in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210811124428.2368491-5-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: remove nbd_del_diskChristoph Hellwig2021-08-131-12/+5
| | | | | | | | | | | | | | | | | | | | Fold nbd_del_disk and remove the pointless NULL check on ->disk given that it is always set for a successfully allocated nbd_device structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210811124428.2368491-4-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: refactor device removalChristoph Hellwig2021-08-131-24/+13
| | | | | | | | | | | | | | | | | | | | Share common code for the synchronous and workqueue based device removal, and remove the pointless use of refcount_dec_and_mutex_lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210811124428.2368491-3-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: do del_gendisk() asynchronously for NBD_DESTROY_ON_DISCONNECTHou Tao2021-08-131-9/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now open_mutex is used to synchronize partition operations (e.g, blk_drop_partitions() and blkdev_reread_part()), however it makes nbd driver broken, because nbd may call del_gendisk() in nbd_release() or nbd_genl_disconnect() if NBD_CFLAG_DESTROY_ON_DISCONNECT is enabled, and deadlock occurs, as shown below: // AB-BA dead-lock nbd_genl_disconnect blkdev_open nbd_disconnect_and_put lock bd_mutex // last ref nbd_put lock nbd_index_mutex del_gendisk nbd_open try lock nbd_index_mutex try lock bd_mutex or // AA dead-lock nbd_release lock bd_mutex nbd_put try lock bd_mutex Instead of fixing block layer (e.g, introduce another lock), fixing the nbd driver to call del_gendisk() in a kworker when NBD_DESTROY_ON_DISCONNECT is enabled. When NBD_DESTROY_ON_DISCONNECT is disabled, nbd device will always be destroy through module removal, and there is no risky of deadlock. To ensure the reuse of nbd index succeeds, moving the calling of idr_remove() after del_gendisk(), so if the reused index is not found in nbd_index_idr, the old disk must have been deleted. And reusing the existing destroy_complete mechanism to ensure nbd_genl_connect() will wait for the completion of del_gendisk(). Also adding a new workqueue for nbd removal, so nbd_cleanup() can ensure all removals complete before exits. Reported-by: syzbot+0fe7752e52337864d29b@syzkaller.appspotmail.com Fixes: c76f48eb5c08 ("block: take bd_mutex around delete_partitions in del_gendisk") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210811124428.2368491-2-hch@lst.de Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: add the check to prevent overflow in __nbd_ioctl()Baokun Li2021-08-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If user specify a large enough value of NBD blocks option, it may trigger signed integer overflow which may lead to nbd->config->bytesize becomes a large or small value, zero in particular. UBSAN: Undefined behaviour in drivers/block/nbd.c:325:31 signed integer overflow: 1024 * 4611686155866341414 cannot be represented in type 'long long int' [...] Call trace: [...] handle_overflow+0x188/0x1dc lib/ubsan.c:192 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213 nbd_size_set drivers/block/nbd.c:325 [inline] __nbd_ioctl drivers/block/nbd.c:1342 [inline] nbd_ioctl+0x998/0xa10 drivers/block/nbd.c:1395 __blkdev_driver_ioctl block/ioctl.c:311 [inline] [...] Although it is not a big deal, still silence the UBSAN by limit the input value. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210804021212.990223-1-libaokun1@huawei.com [axboe: dropped unlikely()] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | nbd: Aovid double completion of a requestXie Yongji2021-08-131-3/+11
|/ | | | | | | | | | | | | | | | | There is a race between iterating over requests in nbd_clear_que() and completing requests in recv_work(), which can lead to double completion of a request. To fix it, flush the recv worker before iterating over the requests and don't abort the completed request while iterating. Fixes: 96d97e17828f ("nbd: clear_sock on netlink disconnect") Reported-by: Jiang Yadong <jiangyadong@bytedance.com> Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210813151330.96-1-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: fix order of cleaning up the queue and freeing the tagsetWang Qing2021-07-151-1/+1
| | | | | | | | | | | | We must release the queue before freeing the tagset. Fixes: 4af5f2e03013 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk") Reported-and-tested-by: syzbot+9ca43ff47167c0ee3466@syzkaller.appspotmail.com Signed-off-by: Wang Qing <wangqing@vivo.com> Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210706040016.1360412-1-guoqing.jiang@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: provide a way for userspace processes to identify device backendsPrasanna Kumar Kalever2021-06-301-1/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: On reconfigure of device, there is no way to defend if the backend storage is matching with the initial backend storage. Say, if an initial connect request for backend "pool1/image1" got mapped to /dev/nbd0 and the userspace process is terminated. A next reconfigure request within NBD_ATTR_DEAD_CONN_TIMEOUT is allowed to use /dev/nbd0 for a different backend "pool1/image2" For example, an operation like below could be dangerous: $ sudo rbd-nbd map --try-netlink rbd-pool/ext4-image /dev/nbd0 $ sudo blkid /dev/nbd0 /dev/nbd0: UUID="bfc444b4-64b1-418f-8b36-6e0d170cfc04" TYPE="ext4" $ sudo pkill -9 rbd-nbd $ sudo rbd-nbd attach --try-netlink --device /dev/nbd0 rbd-pool/xfs-image /dev/nbd0 $ sudo blkid /dev/nbd0 /dev/nbd0: UUID="d29bf343-6570-4069-a9ea-2fa156ced908" TYPE="xfs" Solution: Provide a way for userspace processes to keep some metadata to identify between the device and the backend, so that when a reconfigure request is made, we can compare and avoid such dangerous operations. With this solution, as part of the initial connect request, backend path can be stored in the sysfs per device config, so that on a reconfigure request it's easy to check if the backend path matches with the initial connect backend path. Please note, ioctl interface to nbd will not have these changes, as there won't be any reconfigure. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210429102828.31248-1-prasanna.kalever@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: use blk_mq_alloc_disk and blk_cleanup_diskChristoph Hellwig2021-06-111-32/+21
| | | | | | | | | | Use blk_mq_alloc_disk and blk_cleanup_disk to simplify the gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210602065345.355274-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: share nbd_put and return by goto put_nbdSun Ke2021-05-121-4/+3
| | | | | | | | | | | | | Replace the following two statements by the statement “goto put_nbd;” nbd_put(nbd); return 0; Signed-off-by: Sun Ke <sunke32@huawei.com> Suggested-by: Markus Elfring <Markus.Elfring@web.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210512114331.1233964-3-sunke32@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: Fix NULL pointer in flush_workqueueSun Ke2021-05-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Open /dev/nbdX first, the config_refs will be 1 and the pointers in nbd_device are still null. Disconnect /dev/nbdX, then reference a null recv_workq. The protection by config_refs in nbd_genl_disconnect is useless. [ 656.366194] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 656.368943] #PF: supervisor write access in kernel mode [ 656.369844] #PF: error_code(0x0002) - not-present page [ 656.370717] PGD 10cc87067 P4D 10cc87067 PUD 1074b4067 PMD 0 [ 656.371693] Oops: 0002 [#1] SMP [ 656.372242] CPU: 5 PID: 7977 Comm: nbd-client Not tainted 5.11.0-rc5-00040-g76c057c84d28 #1 [ 656.373661] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 656.375904] RIP: 0010:mutex_lock+0x29/0x60 [ 656.376627] Code: 00 0f 1f 44 00 00 55 48 89 fd 48 83 05 6f d7 fe 08 01 e8 7a c3 ff ff 48 83 05 6a d7 fe 08 01 31 c0 65 48 8b 14 25 00 6d 01 00 <f0> 48 0f b1 55 d [ 656.378934] RSP: 0018:ffffc900005eb9b0 EFLAGS: 00010246 [ 656.379350] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 656.379915] RDX: ffff888104cf2600 RSI: ffffffffaae8f452 RDI: 0000000000000020 [ 656.380473] RBP: 0000000000000020 R08: 0000000000000000 R09: ffff88813bd6b318 [ 656.381039] R10: 00000000000000c7 R11: fefefefefefefeff R12: ffff888102710b40 [ 656.381599] R13: ffffc900005eb9e0 R14: ffffffffb2930680 R15: ffff88810770ef00 [ 656.382166] FS: 00007fdf117ebb40(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000 [ 656.382806] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 656.383261] CR2: 0000000000000020 CR3: 0000000100c84000 CR4: 00000000000006e0 [ 656.383819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 656.384370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 656.384927] Call Trace: [ 656.385111] flush_workqueue+0x92/0x6c0 [ 656.385395] nbd_disconnect_and_put+0x81/0xd0 [ 656.385716] nbd_genl_disconnect+0x125/0x2a0 [ 656.386034] genl_family_rcv_msg_doit.isra.0+0x102/0x1b0 [ 656.386422] genl_rcv_msg+0xfc/0x2b0 [ 656.386685] ? nbd_ioctl+0x490/0x490 [ 656.386954] ? genl_family_rcv_msg_doit.isra.0+0x1b0/0x1b0 [ 656.387354] netlink_rcv_skb+0x62/0x180 [ 656.387638] genl_rcv+0x34/0x60 [ 656.387874] netlink_unicast+0x26d/0x590 [ 656.388162] netlink_sendmsg+0x398/0x6c0 [ 656.388451] ? netlink_rcv_skb+0x180/0x180 [ 656.388750] ____sys_sendmsg+0x1da/0x320 [ 656.389038] ? ____sys_recvmsg+0x130/0x220 [ 656.389334] ___sys_sendmsg+0x8e/0xf0 [ 656.389605] ? ___sys_recvmsg+0xa2/0xf0 [ 656.389889] ? handle_mm_fault+0x1671/0x21d0 [ 656.390201] __sys_sendmsg+0x6d/0xe0 [ 656.390464] __x64_sys_sendmsg+0x23/0x30 [ 656.390751] do_syscall_64+0x45/0x70 [ 656.391017] entry_SYSCALL_64_after_hwframe+0x44/0xa9 To fix it, just add if (nbd->recv_workq) to nbd_disconnect_and_put(). Fixes: e9e006f5fcf2 ("nbd: fix max number of supported devs") Signed-off-by: Sun Ke <sunke32@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210512114331.1233964-2-sunke32@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: handle device refs for DESTROY_ON_DISCONNECT properlyJosef Bacik2021-02-221-13/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There exists a race where we can be attempting to create a new nbd configuration while a previous configuration is going down, both configured with DESTROY_ON_DISCONNECT. Normally devices all have a reference of 1, as they won't be cleaned up until the module is torn down. However with DESTROY_ON_DISCONNECT we'll make sure that there is only 1 reference (generally) on the device for the config itself, and then once the config is dropped, the device is torn down. The race that exists looks like this TASK1 TASK2 nbd_genl_connect() idr_find() refcount_inc_not_zero(nbd) * count is 2 here ^^ nbd_config_put() nbd_put(nbd) (count is 1) setup new config check DESTROY_ON_DISCONNECT put_dev = true if (put_dev) nbd_put(nbd) * free'd here ^^ In nbd_genl_connect() we assume that the nbd ref count will be 2, however clearly that won't be true if the nbd device had been setup as DESTROY_ON_DISCONNECT with its prior configuration. Fix this by getting rid of the runtime flag to check if we need to mess with the nbd device refcount, and use the device NBD_DESTROY_ON_DISCONNECT flag to check if we need to adjust the ref counts. This was reported by syzkaller with the following kasan dump BUG: KASAN: use-after-free in instrument_atomic_read include/linux/instrumented.h:71 [inline] BUG: KASAN: use-after-free in atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] BUG: KASAN: use-after-free in refcount_dec_not_one+0x71/0x1e0 lib/refcount.c:76 Read of size 4 at addr ffff888143bf71a0 by task systemd-udevd/8451 CPU: 0 PID: 8451 Comm: systemd-udevd Not tainted 5.11.0-rc7-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:230 __kasan_report mm/kasan/report.c:396 [inline] kasan_report.cold+0x79/0xd5 mm/kasan/report.c:413 check_memory_region_inline mm/kasan/generic.c:179 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:185 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] refcount_dec_not_one+0x71/0x1e0 lib/refcount.c:76 refcount_dec_and_mutex_lock+0x19/0x140 lib/refcount.c:115 nbd_put drivers/block/nbd.c:248 [inline] nbd_release+0x116/0x190 drivers/block/nbd.c:1508 __blkdev_put+0x548/0x800 fs/block_dev.c:1579 blkdev_put+0x92/0x570 fs/block_dev.c:1632 blkdev_close+0x8c/0xb0 fs/block_dev.c:1640 __fput+0x283/0x920 fs/file_table.c:280 task_work_run+0xdd/0x190 kernel/task_work.c:140 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:174 [inline] exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fc1e92b5270 Code: 73 01 c3 48 8b 0d 38 7d 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c1 20 00 00 75 10 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ee fb ff ff 48 89 04 24 RSP: 002b:00007ffe8beb2d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000007 RCX: 00007fc1e92b5270 RDX: 000000000aba9500 RSI: 0000000000000000 RDI: 0000000000000007 RBP: 00007fc1ea16f710 R08: 000000000000004a R09: 0000000000000008 R10: 0000562f8cb0b2a8 R11: 0000000000000246 R12: 0000000000000000 R13: 0000562f8cb0afd0 R14: 0000000000000003 R15: 000000000000000e Allocated by task 1: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:401 [inline] ____kasan_kmalloc.constprop.0+0x82/0xa0 mm/kasan/common.c:429 kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:682 [inline] nbd_dev_add+0x44/0x8e0 drivers/block/nbd.c:1673 nbd_init+0x250/0x271 drivers/block/nbd.c:2394 do_one_initcall+0x103/0x650 init/main.c:1223 do_initcall_level init/main.c:1296 [inline] do_initcalls init/main.c:1312 [inline] do_basic_setup init/main.c:1332 [inline] kernel_init_freeable+0x605/0x689 init/main.c:1533 kernel_init+0xd/0x1b8 init/main.c:1421 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296 Freed by task 8451: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track+0x1c/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:356 ____kasan_slab_free+0xe1/0x110 mm/kasan/common.c:362 kasan_slab_free include/linux/kasan.h:192 [inline] slab_free_hook mm/slub.c:1547 [inline] slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1580 slab_free mm/slub.c:3143 [inline] kfree+0xdb/0x3b0 mm/slub.c:4139 nbd_dev_remove drivers/block/nbd.c:243 [inline] nbd_put.part.0+0x180/0x1d0 drivers/block/nbd.c:251 nbd_put drivers/block/nbd.c:295 [inline] nbd_config_put+0x6dd/0x8c0 drivers/block/nbd.c:1242 nbd_release+0x103/0x190 drivers/block/nbd.c:1507 __blkdev_put+0x548/0x800 fs/block_dev.c:1579 blkdev_put+0x92/0x570 fs/block_dev.c:1632 blkdev_close+0x8c/0xb0 fs/block_dev.c:1640 __fput+0x283/0x920 fs/file_table.c:280 task_work_run+0xdd/0x190 kernel/task_work.c:140 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:174 [inline] exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff888143bf7000 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 416 bytes inside of 1024-byte region [ffff888143bf7000, ffff888143bf7400) The buggy address belongs to the page: page:000000005238f4ce refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x143bf0 head:000000005238f4ce order:3 compound_mapcount:0 compound_pincount:0 flags: 0x57ff00000010200(slab|head) raw: 057ff00000010200 ffffea00004b1400 0000000300000003 ffff888010c41140 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888143bf7080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888143bf7100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888143bf7180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888143bf7200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Reported-and-tested-by: syzbot+429d3f82d757c211bff3@syzkaller.appspotmail.com Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-blockLinus Torvalds2021-02-211-24/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: - Remove the skd driver. It's been EOL for a long time (Damien) - NVMe pull requests - fix multipath handling of ->queue_rq errors (Chao Leng) - nvmet cleanups (Chaitanya Kulkarni) - add a quirk for buggy Amazon controller (Filippo Sironi) - avoid devm allocations in nvme-hwmon that don't interact well with fabrics (Hannes Reinecke) - sysfs cleanups (Jiapeng Chong) - fix nr_zones for multipath (Keith Busch) - nvme-tcp crash fix for no-data commands (Sagi Grimberg) - nvmet-tcp fixes (Sagi Grimberg) - add a missing __rcu annotation (Christoph) - failed reconnect fixes (Chao Leng) - various tracing improvements (Michal Krakowiak, Johannes Thumshirn) - switch the nvmet-fc assoc_list to use RCU protection (Leonid Ravich) - resync the status codes with the latest spec (Max Gurtovoy) - minor nvme-tcp improvements (Sagi Grimberg) - various cleanups (Rikard Falkeborn, Minwoo Im, Chaitanya Kulkarni, Israel Rukshin) - Floppy O_NDELAY fix (Denis) - MD pull request - raid5 chunk_sectors fix (Guoqing) - Use lore links (Kees) - Use DEFINE_SHOW_ATTRIBUTE for nbd (Liao) - loop lock scaling (Pavel) - mtip32xx PCI fixes (Bjorn) - bcache fixes (Kai, Dongdong) - Misc fixes (Tian, Yang, Guoqing, Joe, Andy) * tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-block: (64 commits) lightnvm: pblk: Replace guid_copy() with export_guid()/import_guid() lightnvm: fix unnecessary NULL check warnings nvme-tcp: fix crash triggered with a dataless request submission block: Replace lkml.org links with lore nbd: Convert to DEFINE_SHOW_ATTRIBUTE nvme: add 48-bit DMA address quirk for Amazon NVMe controllers nvme-hwmon: rework to avoid devm allocation nvmet: remove else at the end of the function nvmet: add nvmet_req_subsys() helper nvmet: use min of device_path and disk len nvmet: use invalid cmd opcode helper nvmet: use invalid cmd opcode helper nvmet: add helper to report invalid opcode nvmet: remove extra variable in id-ns handler nvmet: make nvmet_find_namespace() req based nvmet: return uniform error for invalid ns nvmet: set status to 0 in case for invalid nsid nvmet-fc: add a missing __rcu annotation to nvmet_fc_tgt_assoc.queues nvme-multipath: set nr_zones for zoned namespaces nvmet-tcp: fix potential race of tcp socket closing accept_work ...
| * nbd: Convert to DEFINE_SHOW_ATTRIBUTELiao Pingfang2021-02-101-24/+4
| | | | | | | | | | | | | | | | Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Liao Pingfang <winndows@163.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | nbd: freeze the queue while we're adding connectionsJosef Bacik2021-01-251-0/+8
|/ | | | | | | | | | | | | | | | | | | | When setting up a device, we can krealloc the config->socks array to add new sockets to the configuration. However if we happen to get a IO request in at this point even though we aren't setup we could hit a UAF, as we deref config->socks without any locking, assuming that the configuration was setup already and that ->socks is safe to access it as we have a reference on the configuration. But there's nothing really preventing IO from occurring at this point of the device setup, we don't want to incur the overhead of a lock to access ->socks when it will never change while the device is running. To fix this UAF scenario simply freeze the queue if we are adding sockets. This will protect us from this particular case without adding any additional overhead for the normal running case. Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: Respect max_part for all partition scansJosh Triplett2020-12-171-3/+6
| | | | | | | | | | | The creation path of the NBD device respects max_part and only scans for partitions if max_part is not 0. However, some other code paths ignore max_part, and unconditionally scan for partitions. Add a check for max_part on each partition scan. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: stop using bdget_disk for partition 0Christoph Hellwig2020-12-011-3/+1
| | | | | | | | | | We can just dereference the point in struct gendisk instead. Also remove the now unused export. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove the nr_sects field in struct hd_structChristoph Hellwig2020-12-011-1/+1
| | | | | | | | | | | | | | | | | Now that the hd_struct always has a block device attached to it, there is no need for having two size field that just get out of sync. Additionally the field in hd_struct did not use proper serialization, possibly allowing for torn writes. By only using the block_device field this problem also gets fixed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: use set_capacity_and_notifyChristoph Hellwig2020-11-161-12/+3
| | | | | | | | | | | Use set_capacity_and_notify to update the disk and block device sizes and send a RESIZE uevent to userspace. Note that blktests relies on uevents being sent also for updates that did not change the device size, so the explicit kobject_uevent remains for that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: validate the block size in nbd_set_sizeChristoph Hellwig2020-11-161-32/+15
| | | | | | | | Move the validation of the block from the callers into nbd_set_size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: refactor size updatesChristoph Hellwig2020-11-161-26/+18
| | | | | | | | | | | Merge nbd_size_set and nbd_size_update into a single function that also updates the nbd_config fields. This new function takes the device size in bytes as the first argument, and the blocksize as the second argument, simplifying the calculations required in most callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>