summaryrefslogtreecommitdiffstats
path: root/drivers/md/md.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* md: ensure child flush IO does not affect origin bio->bi_statusLi Nan2024-10-171-1/+23
| | | | | | | | | | | | | | | | | | | | | When a flush is issued to an RAID array, a child flush IO is created and issued for each member disk in the RAID array. Since commit b75197e86e6d ("md: Remove flush handling"), each child flush IO has been chained with the original bio. As a result, the failure of any child IO could modify the bi_status of the original bio, potentially impacting the upper-layer filesystem. Fix the issue by preventing child flush IO from altering the original bio->bi_status as before. However, this design introduces a known issue: in the event of a power failure, if a flush IO on a member disk fails, the upper layers may not be informed. This issue is not easy to fix and will not be addressed for the time being in this issue. Fixes: b75197e86e6d ("md: Remove flush handling") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240919063048.2887579-1-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
* md: Add new_level sysfs interfaceXiao Ni2024-09-061-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | Now reshape supports two ways: with backup file or without backup file. For the situation without backup file, it needs to change data offset. It doesn't need systemd service mdadm-grow-continue. So it can finish the reshape job in one process environment. It can know the new level from mdadm --grow command and can change to new level after reshape finishes. For the situation with backup file, it needs systemd service mdadm-grow-continue to monitor reshape progress. So there are two process envolved. One is mdadm --grow command whick kicks off reshape and wakes up mdadm-grow-continue service. The second process is the service, which doesn't know the new level from the first process. In kernel space mddev->new_level is used to record the new level when doing reshape. This patch adds a new interface to help mdadm update new_level and sync it to metadata. Then mdadm-grow-continue can read the right new_level. Commit log revised by Song Liu. Please refer to the link for more details. Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20240904235453.99120-1-xni@redhat.com Signed-off-by: Song Liu <song@kernel.org>
* md: Report failed arrays as broken in mdstatMateusz Kusiak2024-09-041-2/+7
| | | | | | | | | | | | | | Depending on if array has personality, it is either reported as active or inactive. This patch adds third status "broken" for arrays with personality that became inoperative. The reason is end users tend to assume that "active" indicates array is operational. Add "broken" state for inoperative arrays with personality and refactor the code. Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com> Link: https://lore.kernel.org/r/20240903142949.53628-1-mateusz.kusiak@intel.com Signed-off-by: Song Liu <song@kernel.org>
* Merge branch 'md-6.12-bitmap' into md-6.12Song Liu2024-08-281-55/+100
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From Yu Kuai (with minor changes by Song Liu): The background is that currently bitmap is using a global spin_lock, causing lock contention and huge IO performance degradation for all raid levels. However, it's impossible to implement a new lock free bitmap with current situation that md-bitmap exposes the internal implementation with lots of exported apis. Hence bitmap_operations is invented, to describe bitmap core implementation, and a new bitmap can be introduced with a new bitmap_operations, we only need to switch to the new one during initialization. And with this we can build bitmap as kernel module, but that's not our concern for now. This version was tested with mdadm tests and lvm2 tests. This set does not introduce new errors in these tests. * md-6.12-bitmap: (42 commits) md/md-bitmap: make in memory structure internal md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations md/md-bitmap: merge md_bitmap_free() into bitmap_operations md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation. md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations md/md-bitmap: merge md_bitmap_resize() into bitmap_operations md/md-bitmap: pass in mddev directly for md_bitmap_resize() md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations md/md-bitmap: merge bitmap_unplug() into bitmap_operations md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug() md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync() md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations ... Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operationsYu Kuai2024-08-271-1/+1
| | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-41-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operationsYu Kuai2024-08-271-1/+1
| | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-34-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge bitmap_unplug() into bitmap_operationsYu Kuai2024-08-271-1/+1
| | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-33-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()Yu Kuai2024-08-271-1/+1
| | | | | | | | | | | | | | | | | | | | Add a parameter 'bool sync' to distinguish them, and md_bitmap_unplug_async() won't be exported anymore, hence bitmap_operations only need one op to cover them. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-32-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_dirty_bits() into bitmap_operationsYu Kuai2024-08-271-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. And while we're here, also fix coding style for bitmap_store(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-23-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge bitmap_write_all() into bitmap_operationsYu Kuai2024-08-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-22-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_status() into bitmap_operationsYu Kuai2024-08-271-5/+6
| | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-20-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operationsYu Kuai2024-08-271-2/+2
| | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-19-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_flush() into bitmap_operationsYu Kuai2024-08-271-1/+2
| | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-17-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_destroy() into bitmap_operationsYu Kuai2024-08-271-7/+8
| | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-16-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_load() into bitmap_operationsYu Kuai2024-08-271-3/+4
| | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-15-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: merge md_bitmap_create() into bitmap_operationsYu Kuai2024-08-271-3/+3
| | | | | | | | | | | | | | | | | | So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-14-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: simplify md_bitmap_create() + md_bitmap_load()Yu Kuai2024-08-271-21/+9
| | | | | | | | | | | | | | | | | | | | Other than internal api get_bitmap_from_slot(), all other places will set returned bitmap to mddev->bitmap. So move the setting of mddev->bitmap into md_bitmap_create() to simplify code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-13-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: introduce struct bitmap_operationsYu Kuai2024-08-271-0/+1
| | | | | | | | | | | | | | | | | | | | The structure is empty for now, and will be used in later patches to merge in bitmap operations, so that bitmap implementation won't be exposed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-12-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: add 'file_pages' into struct md_bitmap_statsYu Kuai2024-08-271-6/+11
| | | | | | | | | | | | | | | | | | There are no functional changes, avoid dereferencing bitmap directly to prepare inventing a new bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-8-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: add 'events_cleared' into struct md_bitmap_statsYu Kuai2024-08-271-2/+14
| | | | | | | | | | | | | | | | | | Also add a new helper to get events_cleared to avoid dereferencing bitmap directly to prepare inventing a new bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-5-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md: use new helper md_bitmap_get_stats() in update_array_info()Yu Kuai2024-08-271-5/+7
| | | | | | | | | | | | | | | | | | There are no functional changes, avoid dereferencing bitmap directly to prepare inventing a new bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-4-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
| * md/md-bitmap: replace md_bitmap_status() with a new helper md_bitmap_get_stats()Yu Kuai2024-08-271-1/+28
| | | | | | | | | | | | | | | | | | | | There are no functional changes, and the new helper will be used in multiple places in following patches to avoid dereferencing bitmap directly. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
* | md: Remove flush handlingYu Kuai2024-08-281-123/+15
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For flush request, md has a special flush handling to merge concurrent flush request into single one, however, the whole mechanism is based on a disk level spin_lock 'mddev->lock'. And fsync can be called quite often in some user cases, for consequence, spin lock from IO fast path can cause performance degradation. Fortunately, the block layer already has flush handling to merge concurrent flush request, and it only acquires hctx level spin lock. (see details in blk-flush.c) This patch removes the flush handling in md, and converts to use general block layer flush handling in underlying disks. Flush test for 4 nvme raid10: start 128 threads to do fsync 100000 times, on arm64, see how long it takes. Test script: void* thread_func(void* arg) { int fd = *(int*)arg; for (int i = 0; i < FSYNC_COUNT; i++) { fsync(fd); } return NULL; } int main() { int fd = open("/dev/md0", O_RDWR); if (fd < 0) { perror("open"); exit(1); } pthread_t threads[THREADS]; struct timeval start, end; gettimeofday(&start, NULL); for (int i = 0; i < THREADS; i++) { pthread_create(&threads[i], NULL, thread_func, &fd); } for (int i = 0; i < THREADS; i++) { pthread_join(threads[i], NULL); } gettimeofday(&end, NULL); close(fd); long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec); printf("Elapsed time: %lld microseconds\n", elapsed); return 0; } Test result: about 10 times faster: Before this patch: 50943374 microseconds After this patch: 5096347 microseconds Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
* md: Don't flush sync_work in md_write_start()Yu Kuai2024-08-271-1/+0
| | | | | | | | | | | | | | Because flush sync_work may trigger mddev_suspend() if there are spares, and this should never be done in IO path because mddev_suspend() is used to wait for IO. This problem is found by code review. Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240801124746.242558-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
* md-cluster: fix no recovery job when adding/re-adding a diskHeming Zhao2024-07-121-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit db5e653d7c9f ("md: delay choosing sync action to md_start_sync()") delays the start of the sync action. In a clustered environment, this will cause another node to first activate the spare disk and skip recovery. As a result, no nodes will perform recovery when a disk is added or re-added. Before db5e653d7c9f: ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb | sendmsg: METADATA_UPDATED + md_choose_sync_action process_metadata_update | remove_and_add_spares //node1 has not finished adding + call mddev->sync_work //the spare disk:do nothing md_start_sync starts md_do_sync md_do_sync + grabbed resync_lockres:DLM_LOCK_EX + do syncing job md_check_recovery sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... md_do_sync waiting to grab resync_lockres:EX ``` After db5e653d7c9f: (note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery starts md_start_sync, setting the INTR action will exacerbate the delay in node1 calling the md_do_sync function.) ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb | sendmsg: METADATA_UPDATED + calls mddev->sync_work process_metadata_update //node1 has not finished adding //the spare disk:do nothing md_start_sync + md_choose_sync_action | remove_and_add_spares + calls md_do_sync md_check_recovery md_update_sb sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... ... ... md_do_sync + grabbed resync_lockres:EX + raid1_sync_request skip sync under conf->fullsync:0 md_do_sync 1. waiting to grab resync_lockres:EX 2. when node1 could grab EX lock, node1 will skip resync under recovery_offset:MaxSector ``` How to trigger: ```(commands @node1) # to easily watch the recovery status echo 2000 > /proc/sys/dev/raid/speed_limit_max ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max" mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda mdadm --manage /dev/md0 --add /dev/sdc === "cat /proc/mdstat" on both node, there are no recovery action. === ``` How to fix: because md layer code logic is hard to restore for speeding up sync job on local node, we add new cluster msg to pending the another node to active disk. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
* md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctlYu Kuai2024-07-041-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit 90f5f7ad4f38 ("md: Wait for md_check_recovery before attempting device removal.") explained in the commit message that failed device must be reomoved from the personality first by md_check_recovery(), before it can be removed from the array. That's the reason the commit add the code to wait for MD_RECOVERY_NEEDED. However, this is not the case now, because remove_and_add_spares() is called directly from hot_remove_disk() from ioctl path, hence failed device(marked faulty) can be removed from the personality by ioctl. On the other hand, the commit introduced a performance problem that if MD_RECOVERY_NEEDED is set and the array is not running, ioctl will wait for 5s before it can return failure to user. Since the waiting is not needed now, fix the problem by removing the waiting. Fixes: 90f5f7ad4f38 ("md: Wait for md_check_recovery before attempting device removal.") Reported-by: Mateusz Kusiak <mateusz.kusiak@linux.intel.com> Closes: https://lore.kernel.org/all/814ff6ee-47a2-4ba0-963e-cf256ee4ecfa@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240627112321.3044744-1-yukuai1@huaweicloud.com
* md-cluster: Constify struct md_cluster_operationsChristophe JAILLET2024-07-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | 'struct md_cluster_operations' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 51941 1442 80 53463 d0d7 drivers/md/md-cluster.o After: ===== text data bss dec hex filename 52133 1246 80 53459 d0d3 drivers/md/md-cluster.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.1719173852.git.christophe.jaillet@wanadoo.fr
* md: Remove unneeded semicolonYang Li2024-07-041-1/+1
| | | | | | | | | | ./drivers/md/md.c:630:21-22: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9344 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240618010759.85416-1-yang.lee@linux.alibaba.com
* md: set md-specific flags for all queue limitsChristoph Hellwig2024-06-261-5/+9
| | | | | | | | | | | | | | | | The md driver wants to enforce a number of flags for all devices, even when not inheriting them from the underlying devices. To make sure these flags survive the queue_limits_set calls that md uses to update the queue limits without deriving them form the previous limits add a new md_init_stacking_limits helper that calls blk_set_stacking_limits and sets these flags. Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240626142637.300624-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe2024-06-191-34/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge in last round of queue limits changes from Christoph. * for-6.11/block-limits: (26 commits) block: move the bounce flag into the features field block: move the skip_tagset_quiesce flag to queue_limits block: move the pci_p2pdma flag to queue_limits block: move the zone_resetall flag to queue_limits block: move the zoned flag into the features field block: move the poll flag to queue_limits block: move the dax flag to queue_limits block: move the nowait flag to queue_limits block: move the synchronous flag to queue_limits block: move the stable_writes flag to queue_limits block: move the io_stat flag setting to queue_limits block: move the add_random flag to queue_limits block: move the nonrot flag to queue_limits block: move cache control settings out of queue->flags block: remove blk_flush_policy block: freeze the queue in queue_attr_store nbd: move setting the cache control flags to __nbd_set_size virtio_blk: remove virtblk_update_cache_mode loop: fold loop_update_rotational into loop_reconfigure_limits loop: also use the default block size from an underlying block device ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: move the nowait flag to queue_limitsChristoph Hellwig2024-06-191-17/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the nowait flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Stacking drivers are simplified in that they now can simply set the flag, and blk_stack_limits will clear it when the features is not supported by any of the underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: move the io_stat flag setting to queue_limitsChristoph Hellwig2024-06-191-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the io_stat flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Simplify md and dm to set the flag unconditionally instead of avoiding setting a simple flag for cases where it already is set by other means, which is a bit pointless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: move the nonrot flag to queue_limitsChristoph Hellwig2024-06-191-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the nonrot flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Use the chance to switch to defaulting to non-rotational and require the driver to opt into rotational, which matches the polarity of the sysfs interface. For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new rotational flag is not set as they clearly are not rotational despite this being a behavior change. There are some other drivers that unconditionally set the rotational flag to keep the existing behavior as they arguably can be used on rotational devices even if that is probably not their main use today (e.g. virtio_blk and drbd). The flag is automatically inherited in blk_stack_limits matching the existing behavior in dm and md. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: move cache control settings out of queue->flagsChristoph Hellwig2024-06-191-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the cache control settings into the queue_limits so that the flags can be set atomically with the device queue frozen. Add new features and flags field for the driver set flags, and internal (usually sysfs-controlled) flags in the block layer. Note that we'll eventually remove enough field from queue_limits to bring it back to the previous size. The disable flag is inverted compared to the previous meaning, which means it now survives a rescan, similar to the max_sectors and max_discard_sectors user limits. The FLUSH and FUA flags are now inherited by blk_stack_limits, which simplified the code in dm a lot, but also causes a slight behavior change in that dm-switch and dm-unstripe now advertise a write cache despite setting num_flush_bios to 0. The I/O path will handle this gracefully, but as far as I can tell the lack of num_flush_bios and thus flush support is a pre-existing data integrity bug in those targets that really needs fixing, after which a non-zero num_flush_bios should be required in dm for targets that map to underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe2024-06-141-55/+17
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull in block limits branch, which exists as a shared branch for both the block and SCSI tree. * for-6.11/block-limits: (26 commits) block: move integrity information into queue_limits block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags block: bypass the STABLE_WRITES flag for protection information block: don't require stable pages for non-PI metadata block: use kstrtoul in flag_store block: factor out flag_{store,show} helper for integrity block: remove the blk_flush_integrity call in blk_integrity_unregister block: remove the blk_integrity_profile structure dm-integrity: use the nop integrity profile md/raid1: don't free conf on raid0_run failure md/raid0: don't free conf on raid0_run failure block: initialize integrity buffer to zero before writing it to media block: add special APIs for run-time disabling of discard and friends block: remove unused queue limits API sr: convert to the atomic queue limits API sd: convert to the atomic queue limits API sd: cleanup zoned queue limits initialization sd: factor out a sd_discard_mode helper sd: simplify the disable case in sd_config_discard sd: add a sd_disable_write_same helper ...
| * block: move integrity information into queue_limitsChristoph Hellwig2024-06-141-55/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the integrity information into the queue limits so that it can be set atomically with other queue limits, and that the sysfs changes to the read_verify and write_generate flags are properly synchronized. This also allows to provide a more useful helper to stack the integrity fields, although it still is separate from the main stacking function as not all stackable devices want to inherit the integrity settings. Even with that it greatly simplifies the code in md and dm. Note that the integrity field is moved as-is into the queue limits. While there are good arguments for removing the separate blk_integrity structure, this would cause a lot of churn and might better be done at a later time if desired. However the integrity field in the queue_limits structure is now unconditional so that various ifdefs can be avoided or replaced with IS_ENABLED(). Given that tiny size of it that seems like a worthwhile trade off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | md: pass in max_sectors for pers->sync_request()Yu Kuai2024-06-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | For different sync_action, sync_thread will use different max_sectors, see details in md_sync_max_sectors(), currently both md_do_sync() and pers->sync_request() in eatch iteration have to get the same max_sectors. Hence pass in max_sectors for pers->sync_request() to prevent redundant code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-12-yukuai1@huaweicloud.com
* | md: factor out helpers for different sync_action in md_do_sync()Yu Kuai2024-06-121-50/+73
| | | | | | | | | | | | | | | | | | | | Make code cleaner by replacing if else if with switch, and it's more obvious now what is doing for each sync_action. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-11-yukuai1@huaweicloud.com
* | md: replace last_sync_action with new enum typeYu Kuai2024-06-121-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only difference is that "none" is removed and initial last_sync_action will be idle. On the one hand, this value is introduced by commit c4a395514516 ("MD: Remember the last sync operation that was performed"), and the usage described in commit message is not affected. On the other hand, last_sync_action is not used in mdadm or mdmon, and none of the tests that I can find. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-10-yukuai1@huaweicloud.com
* | md: use new helpers in md_do_sync()Yu Kuai2024-06-121-16/+5
| | | | | | | | | | | | | | | | | | | | Make code cleaner. and also use the action_name directly in kernel log: - "check" instead of "data-check" - "repair" instead of "requested-resync" Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-9-yukuai1@huaweicloud.com
* | md: don't fail action_store() if sync_thread is not registeredYu Kuai2024-06-121-52/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD_RECOVERY_RUNNING will always be set when trying to register a new sync_thread, however, if md_start_sync() turns out to do nothing, MD_RECOVERY_RUNNING will be cleared in this case. And during the race window, action_store() will return -EBUSY, which will cause some mdadm tests to fail. For example: The test 07reshape5intr will add a new disk to array, then start reshape: mdadm /dev/md0 --add /dev/xxx mdadm --grow /dev/md0 -n 3 And add_bound_rdev() from mdadm --add will set MD_RECOVERY_NEEDED, then during the race windown, mdadm --grow will fail. Fix the problem by waiting in action_store() during the race window, fail only if sync_thread is registered. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-8-yukuai1@huaweicloud.com
* | md: remove parameter check_seq for stop_sync_thread()Yu Kuai2024-06-121-15/+11
| | | | | | | | | | | | | | | | | | | | Caller will always set MD_RECOVERY_FROZEN if check_seq is true, and always clear MD_RECOVERY_FROZEN if check_seq is false, hence replace the parameter with test_bit() to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-7-yukuai1@huaweicloud.com
* | md: replace sysfs api sync_action with new helpersYu Kuai2024-06-121-42/+52
| | | | | | | | | | | | | | | | To get rid of extrem long if else if usage, and make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-6-yukuai1@huaweicloud.com
* | md: factor out helper to start reshape from action_store()Yu Kuai2024-06-121-24/+41
| | | | | | | | | | | | | | | | | | There are no functional changes, just to make code cleaner and prepare for following refactor. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-5-yukuai1@huaweicloud.com
* | md: add new helpers for sync_actionYu Kuai2024-06-121-0/+79
| | | | | | | | | | | | | | | | | | The new helpers will get current sync_action of the array, will be used in later patches to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-4-yukuai1@huaweicloud.com
* | md: make md_flush_request() more readableLi Nan2024-06-101-15/+13
| | | | | | | | | | | | | | | | | | | | | | Setting bio to NULL and checking 'if(!bio)' is redundant and looks strange, just consolidate them into one condition. There are no functional changes. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240528203149.2383260-1-linan666@huaweicloud.com
* | md: fix deadlock between mddev_suspend and flush bioLi Nan2024-06-101-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Deadlock occurs when mddev is being suspended while some flush bio is in progress. It is a complex issue. T1. the first flush is at the ending stage, it clears 'mddev->flush_bio' and tries to submit data, but is blocked because mddev is suspended by T4. T2. the second flush sets 'mddev->flush_bio', and attempts to queue md_submit_flush_data(), which is already running (T1) and won't execute again if on the same CPU as T1. T3. the third flush inc active_io and tries to flush, but is blocked because 'mddev->flush_bio' is not NULL (set by T2). T4. mddev_suspend() is called and waits for active_io dec to 0 which is inc by T3. T1 T2 T3 T4 (flush 1) (flush 2) (third 3) (suspend) md_submit_flush_data mddev->flush_bio = NULL; . . md_flush_request . mddev->flush_bio = bio . queue submit_flushes . . . . md_handle_request . . active_io + 1 . . md_flush_request . . wait !mddev->flush_bio . . . . mddev_suspend . . wait !active_io . . . submit_flushes . queue_work md_submit_flush_data . //md_submit_flush_data is already running (T1) . md_handle_request wait resume The root issue is non-atomic inc/dec of active_io during flush process. active_io is dec before md_submit_flush_data is queued, and inc soon after md_submit_flush_data() run. md_flush_request active_io + 1 submit_flushes active_io - 1 md_submit_flush_data md_handle_request active_io + 1 make_request active_io - 1 If active_io is dec after md_handle_request() instead of within submit_flushes(), make_request() can be called directly intead of md_handle_request() in md_submit_flush_data(), and active_io will only inc and dec once in the whole flush process. Deadlock will be fixed. Additionally, the only difference between fixing the issue and before is that there is no return error handling of make_request(). But after previous patch cleaned md_write_start(), make_requst() only return error in raid5_make_request() by dm-raid, see commit 41425f96d7aa ("dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape)". Since dm always splits data and flush operation into two separate io, io size of flush submitted by dm always is 0, make_request() will not be called in md_submit_flush_data(). To prevent future modifications from introducing issues, add WARN_ON to ensure make_request() no error is returned in this context. Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-3-linan666@huaweicloud.com
* | md: change the return value type of md_write_start to voidLi Nan2024-06-101-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()") aborted md_write_start() with false when mddev is suspended, which fixed a deadlock if calling mddev_suspend() with holding reconfig_mutex(). Since mddev_suspend() now includes lockdep_assert_not_held(), it no longer holds the reconfig_mutex. This makes previous abort unnecessary. Now, remove unnecessary abort and change function return value to void. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-2-linan666@huaweicloud.com
* | md: do not delete safemode_timer in mddev_suspendLi Nan2024-06-101-1/+0
|/ | | | | | | | | | | | | | | | | | | | | | The deletion of safemode_timer in mddev_suspend() is redundant and potentially harmful now. If timer is about to be woken up but gets deleted, 'in_sync' will remain 0 until the next write, causing array to stay in the 'active' state instead of transitioning to 'clean'. Commit 0d9f4f135eb6 ("MD: Add del_timer_sync to mddev_suspend (fix nasty panic))" introduced this deletion for dm, because if timer fired after dm is destroyed, the resource which the timer depends on might have been freed. However, commit 0dd84b319352 ("md: call __md_stop_writes in md_stop") added __md_stop_writes() to md_stop(), which is called before freeing resource. Timer is deleted in __md_stop_writes(), and the origin issue is resolved. Therefore, delete safemode_timer can be removed safely now. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240508092053.1447930-1-linan666@huaweicloud.com
* md: Revert "md: Fix overflow in is_mddev_idle"Li Nan2024-05-071-4/+3
| | | | | | | | | | | | | | This reverts commit 3f9f231236ce7e48780d8a4f1f8cb9fae2df1e4e. Using 64bit for 'sync_io' is unnecessary from the gendisk side. This overflow will not cause any functional impact, except for a UBSAN warning. Solving this overflow requires introducing additional calculations and checks which are not necessary. So just keep using 32bit for 'sync_io'. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20240507023103.781816-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>