| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
struct bioctx includes the ref refcount_t to track the number of I/O
fragments used to process a target BIO as well as ensure that the zone
of the BIO is kept in the active state throughout the lifetime of the
BIO. However, since decrementing of this reference count is done in the
target .end_io method, the function bio_endio() must be called multiple
times for read and write target BIOs, which causes problems with the
value of the __bi_remaining struct bio field for chained BIOs (e.g. the
clone BIO passed by dm core is large and splits into fragments by the
block layer), resulting in incorrect values and inconsistencies with the
BIO_CHAIN flag setting. This is turn triggers the BUG_ON() call:
BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);
in bio_remaining_done() called from bio_endio().
Fix this ensuring that bio_endio() is called only once for any target
BIO by always using internal clone BIOs for processing any read or
write target BIO. This allows reference counting using the target BIO
context counter to trigger the target BIO completion bio_endio() call
once all data, metadata and other zone work triggered by the BIO
complete.
Overall, this simplifies the code too as the target .end_io becomes
unnecessary and differences between read and write BIO issuing and
completion processing disappear.
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- The biggest change this cycle is to remove support for the legacy IO
path (.request_fn) from request-based DM.
Jens has already started preparing for complete removal of the legacy
IO path in 4.21 but this earlier removal of support from DM has been
coordinated with Jens (as evidenced by the commit being attributed to
him).
Making request-based DM exclussively blk-mq only cleans up that
portion of DM core quite nicely.
- Convert the thinp and zoned targets over to using refcount_t where
applicable.
- A couple fixes to the DM zoned target for refcounting and other races
buried in the implementation of metadata block creation and use.
- Small cleanups to remove redundant unlikely() around a couple
WARN_ON_ONCE().
- Simplify how dm-ioctl copies from userspace, eliminating some
potential for a malicious user trying to change the executed ioctl
after its processing has begun.
- Tweaked DM crypt target to use the DM device name when naming the
various workqueues created for a particular DM crypt device (makes
the N workqueues for a DM crypt device more easily understood and
enhances user's accounting capabilities at a glance via "ps")
- Small fixup to remove dead branch in DM writecache's memory_entry().
* tag 'for-4.20/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm writecache: remove disabled code in memory_entry()
dm zoned: fix various dmz_get_mblock() issues
dm zoned: fix metadata block ref counting
dm raid: avoid bitmap with raid4/5/6 journal device
dm crypt: make workqueue names device-specific
dm: add dm_table_device_name()
dm ioctl: harden copy_params()'s copy_from_user() from malicious users
dm: remove unnecessary unlikely() around WARN_ON_ONCE()
dm zoned: target: use refcount_t for dm zoned reference counters
dm thin: use refcount_t for thin_c reference counting
dm table: require that request-based DM be layered on blk-mq devices
dm: rename DM_TYPE_MQ_REQUEST_BASED to DM_TYPE_REQUEST_BASED
dm: remove legacy request-based IO path
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The API surrounding refcount_t should be used in place of atomic_t
when variables are being used as reference counters. This API can
prevent issues such as counter overflows and use-after-free
conditions. Within the dm zoned target stack, the atomic_t API is
used for bioctx->ref and cw->refcount. Change these to use
refcount_t, avoiding the issues mentioned.
Signed-off-by: John Pittman <jpittman@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce the blkdev_nr_zones() helper function to get the total
number of zones of a zoned block device. This number is always 0 for a
regular block device (q->limits.zoned == BLK_ZONED_NONE case).
Replace hard-coded number of zones calculation in dmz_get_zoned_device()
with a call to this helper.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch avoids that lockdep reports the following:
======================================================
WARNING: possible circular locking dependency detected
4.18.0-rc1 #62 Not tainted
------------------------------------------------------
kswapd0/84 is trying to acquire lock:
00000000c313516d (&xfs_nondir_ilock_class){++++}, at: xfs_free_eofblocks+0xa2/0x1e0
but task is already holding lock:
00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (fs_reclaim){+.+.}:
kmem_cache_alloc+0x2c/0x2b0
radix_tree_node_alloc.constprop.19+0x3d/0xc0
__radix_tree_create+0x161/0x1c0
__radix_tree_insert+0x45/0x210
dmz_map+0x245/0x2d0 [dm_zoned]
__map_bio+0x40/0x260
__split_and_process_non_flush+0x116/0x220
__split_and_process_bio+0x81/0x180
__dm_make_request.isra.32+0x5a/0x100
generic_make_request+0x36e/0x690
submit_bio+0x6c/0x140
mpage_readpages+0x19e/0x1f0
read_pages+0x6d/0x1b0
__do_page_cache_readahead+0x21b/0x2d0
force_page_cache_readahead+0xc4/0x100
generic_file_read_iter+0x7c6/0xd20
__vfs_read+0x102/0x180
vfs_read+0x9b/0x140
ksys_read+0x55/0xc0
do_syscall_64+0x5a/0x1f0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #1 (&dmz->chunk_lock){+.+.}:
dmz_map+0x133/0x2d0 [dm_zoned]
__map_bio+0x40/0x260
__split_and_process_non_flush+0x116/0x220
__split_and_process_bio+0x81/0x180
__dm_make_request.isra.32+0x5a/0x100
generic_make_request+0x36e/0x690
submit_bio+0x6c/0x140
_xfs_buf_ioapply+0x31c/0x590
xfs_buf_submit_wait+0x73/0x520
xfs_buf_read_map+0x134/0x2f0
xfs_trans_read_buf_map+0xc3/0x580
xfs_read_agf+0xa5/0x1e0
xfs_alloc_read_agf+0x59/0x2b0
xfs_alloc_pagf_init+0x27/0x60
xfs_bmap_longest_free_extent+0x43/0xb0
xfs_bmap_btalloc_nullfb+0x7f/0xf0
xfs_bmap_btalloc+0x428/0x7c0
xfs_bmapi_write+0x598/0xcc0
xfs_iomap_write_allocate+0x15a/0x330
xfs_map_blocks+0x1cf/0x3f0
xfs_do_writepage+0x15f/0x7b0
write_cache_pages+0x1ca/0x540
xfs_vm_writepages+0x65/0xa0
do_writepages+0x48/0xf0
__writeback_single_inode+0x58/0x730
writeback_sb_inodes+0x249/0x5c0
wb_writeback+0x11e/0x550
wb_workfn+0xa3/0x670
process_one_work+0x228/0x670
worker_thread+0x3c/0x390
kthread+0x11c/0x140
ret_from_fork+0x3a/0x50
-> #0 (&xfs_nondir_ilock_class){++++}:
down_read_nested+0x43/0x70
xfs_free_eofblocks+0xa2/0x1e0
xfs_fs_destroy_inode+0xac/0x270
dispose_list+0x51/0x80
prune_icache_sb+0x52/0x70
super_cache_scan+0x127/0x1a0
shrink_slab.part.47+0x1bd/0x590
shrink_node+0x3b5/0x470
balance_pgdat+0x158/0x3b0
kswapd+0x1ba/0x600
kthread+0x11c/0x140
ret_from_fork+0x3a/0x50
other info that might help us debug this:
Chain exists of:
&xfs_nondir_ilock_class --> &dmz->chunk_lock --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&dmz->chunk_lock);
lock(fs_reclaim);
lock(&xfs_nondir_ilock_class);
*** DEADLOCK ***
3 locks held by kswapd0/84:
#0: 00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
#1: 000000000f8208f5 (shrinker_rwsem){++++}, at: shrink_slab.part.47+0x3f/0x590
#2: 00000000cacefa54 (&type->s_umount_key#43){.+.+}, at: trylock_super+0x16/0x50
stack backtrace:
CPU: 7 PID: 84 Comm: kswapd0 Not tainted 4.18.0-rc1 #62
Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
Call Trace:
dump_stack+0x85/0xcb
print_circular_bug.isra.36+0x1ce/0x1db
__lock_acquire+0x124e/0x1310
lock_acquire+0x9f/0x1f0
down_read_nested+0x43/0x70
xfs_free_eofblocks+0xa2/0x1e0
xfs_fs_destroy_inode+0xac/0x270
dispose_list+0x51/0x80
prune_icache_sb+0x52/0x70
super_cache_scan+0x127/0x1a0
shrink_slab.part.47+0x1bd/0x590
shrink_node+0x3b5/0x470
balance_pgdat+0x158/0x3b0
kswapd+0x1ba/0x600
kthread+0x11c/0x140
ret_from_fork+0x3a/0x50
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Fixes: 4218a9554653 ("dm zoned: use GFP_NOIO in I/O path")
Cc: <stable@vger.kernel.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Eliminate most holes in DM data structures that were modified by
commit 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()").
Also prevent structure members from unnecessarily spanning cache
lines.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
|
|
|
|
| |
Convert dm to embedded bio sets.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
| |
Use the fmode_t that is passed to dm_blk_ioctl() rather than
inconsistently (varies across targets) drop it on the floor by
overriding it with the fmode_t stored in 'struct dm_dev'.
All the persistent reservation functions weren't using the fmode_t they
got back from .prepare_ioctl so remove them.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
| |
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The SCSI layer allows ZBC drives to have a smaller last runt zone. For
such a device, specifying the entire capacity for a dm-zoned target
table entry fails because the specified capacity is not aligned on a
device zone size indicated in the request queue structure of the
device.
Fix this problem by ignoring the last runt zone in the entry length
when seting up the dm-zoned target (ctr method) and when iterating table
entries of the target (iterate_devices method). This allows dm-zoned
users to still easily setup a target using the entire device capacity
(as mandated by dm-zoned) or the aligned capacity excluding the last
runt zone.
While at it, replace direct references to the device queue chunk_sectors
limit with calls to the accessor blk_queue_zone_sectors().
Reported-by: Peter Desnoyers <pjd@ccs.neu.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
| |
Use GFP_NOIO for memory allocations in the I/O path. Other memory
allocations in the initialization path can use GFP_KERNEL.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
The value REQ_OP_FLUSH is only used by the block code for
request-based devices.
Remove the tests for REQ_OP_FLUSH from the bio-based dm-zoned-target.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The dm-zoned device mapper target provides transparent write access
to zoned block devices (ZBC and ZAC compliant block devices).
dm-zoned hides to the device user (a file system or an application
doing raw block device accesses) any constraint imposed on write
requests by the device, equivalent to a drive-managed zoned block
device model.
Write requests are processed using a combination of on-disk buffering
using the device conventional zones and direct in-place processing for
requests aligned to a zone sequential write pointer position.
A background reclaim process implemented using dm_kcopyd_copy ensures
that conventional zones are always available for executing unaligned
write requests. The reclaim process overhead is minimized by managing
buffer zones in a least-recently-written order and first targeting the
oldest buffer zones. Doing so, blocks under regular write access (such
as metadata blocks of a file system) remain stored in conventional
zones, resulting in no apparent overhead.
dm-zoned implementation focus on simplicity and on minimizing overhead
(CPU, memory and storage overhead). For a 14TB host-managed disk with
256 MB zones, dm-zoned memory usage per disk instance is at most about
3 MB and as little as 5 zones will be used internally for storing metadata
and performing buffer zone reclaim operations. This is achieved using
zone level indirection rather than a full block indirection system for
managing block movement between zones.
dm-zoned primary target is host-managed zoned block devices but it can
also be used with host-aware device models to mitigate potential
device-side performance degradation due to excessive random writing.
Zoned block devices can be formatted and checked for use with the dm-zoned
target using the dmzadm utility available at:
https://github.com/hgst/dm-zoned-tools
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
[Mike Snitzer partly refactored Damien's original work to cleanup the code]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|