summaryrefslogtreecommitdiffstats
path: root/drivers/md/raid10.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* md/raid10: improve discard request for far layoutXiao Ni2020-09-251-23/+63
| | | | | | | | | | | | For far layout, the discard region is not continuous on disks. So it needs far copies r10bio to cover all regions. It needs a way to know all r10bios have finish or not. Similar with raid10_sync_request, only the first r10bio master_bio records the discard bio. Other r10bios master_bio record the first r10bio. The first r10bio can finish after other r10bios finish and then return the discard bio. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md/raid10: improve raid10 discard requestXiao Ni2020-09-251-1/+255
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now the discard request is split by chunk size. So it takes a long time to finish mkfs on disks which support discard function. This patch improve handling raid10 discard request. It uses the similar way with patch 29efc390b (md/md0: optimize raid0 discard handling). But it's a little complex than raid0. Because raid10 has different layout. If raid10 is offset layout and the discard request is smaller than stripe size. There are some holes when we submit discard bio to underlayer disks. For example: five disks (disk1 - disk5) D01 D02 D03 D04 D05 D05 D01 D02 D03 D04 D06 D07 D08 D09 D10 D10 D06 D07 D08 D09 The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request can handle one chunk perfectly. So the part that is not aligned with stripe size is still handled by raid10_write_request. If reshape is running when discard bio comes and the discard bio spans the reshape position, raid10_write_request is responsible to handle this discard bio. I did a test with this patch set. Without patch: time mkfs.xfs /dev/md0 real4m39.775s user0m0.000s sys0m0.298s With patch: time mkfs.xfs /dev/md0 real0m0.105s user0m0.000s sys0m0.007s nvme3n1 259:1 0 477G 0 disk └─nvme3n1p1 259:10 0 50G 0 part nvme4n1 259:2 0 477G 0 disk └─nvme4n1p1 259:11 0 50G 0 part nvme5n1 259:6 0 477G 0 disk └─nvme5n1p1 259:12 0 50G 0 part nvme2n1 259:9 0 477G 0 disk └─nvme2n1p1 259:15 0 50G 0 part nvme0n1 259:13 0 477G 0 disk └─nvme0n1p1 259:14 0 50G 0 part Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md/raid10: pull codes that wait for blocked dev into one functionXiao Ni2020-09-251-51/+67
| | | | | | | | The following patch will reuse these logics, so pull the same codes into one function. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md/raid10: extend r10bio devs to raid disksXiao Ni2020-09-251-5/+4
| | | | | | | | | | Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit to all member disks and it needs to use r10bio. So extend to r10bio->devs[geo.raid_disks]. Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md: Simplify code with existing definition RESYNC_SECTORS in raid10.cZhen Lei2020-09-251-4/+4
| | | | | | | | | | #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9) "RESYNC_BLOCK_SIZE/512" is equal to "RESYNC_BLOCK_SIZE >> 9", replace it with RESYNC_SECTORS. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* block: lift setting the readahead size into the block layerChristoph Hellwig2020-09-241-23/+1
| | | | | | | | | | | | | | | | | | Drivers shouldn't really mess with the readahead size, as that is a VM concept. Instead set it based on the optimal I/O size by lifting the algorithm from the md driver when registering the disk. Also set bdi->io_pages there as well by applying the same scheme based on max_sectors. To ensure the limits work well for stacking drivers a new helper is added to update the readahead limits from the block limits, which is also called from disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md: update the optimal I/O size on reshapeChristoph Hellwig2020-09-241-8/+14
| | | | | | | | | | | | | The raid5 and raid10 drivers currently update the read-ahead size, but not the optimal I/O size on reshape. To prepare for deriving the read-ahead size from the optimal I/O size make sure it is updated as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-blockLinus Torvalds2020-08-051-6/+14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: - NVMe: - ZNS support (Aravind, Keith, Matias, Niklas) - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David, Dongli, Max, Sagi) - null_blk zone capacity support (Aravind) - MD: - raid5/6 fixes (ChangSyun) - Warning fixes (Damien) - raid5 stripe fixes (Guoqing, Song, Yufen) - sysfs deadlock fix (Junxiao) - raid10 deadlock fix (Vitaly) - struct_size conversions (Gustavo) - Set of bcache updates/fixes (Coly) * tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits) md/raid5: Allow degraded raid6 to do rmw md/raid5: Fix Force reconstruct-write io stuck in degraded raid5 raid5: don't duplicate code for different paths in handle_stripe raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show md: print errno in super_written md/raid5: remove the redundant setting of STRIPE_HANDLE md: register new md sysfs file 'uuid' read-only md: fix max sectors calculation for super 1.0 nvme-loop: remove extra variable in create ctrl nvme-loop: set ctrl state connecting after init nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths nvme-multipath: fix logic for non-optimized paths nvme-rdma: fix controller reset hang during traffic nvme-tcp: fix controller reset hang during traffic nvmet: introduce the passthru Kconfig option nvmet: introduce the passthru configfs interface nvmet: Add passthru enable/disable helpers nvmet: add passthru code to process commands nvme: export nvme_find_get_ns() and nvme_put_ns() nvme: introduce nvme_ctrl_get_by_path() ...
| * md/raid10: avoid deadlock on recovery.Vitaly Mayatskikh2020-07-221-3/+11
| | | | | | | | | | | | | | | | | | | | | | When disk failure happens and the array has a spare drive, resync thread kicks in and starts to refill the spare. However it may get blocked by a retry thread that resubmits failed IO to a mirror and itself can get blocked on a barrier raised by the resync thread. Acked-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Vitaly Mayatskikh <vmayatskikh@digitalocean.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| * md: raid10: Fix compilation warningDamien Le Moal2020-07-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Remove the if statement around the call to sysfs_link_rdev() in raid10_start_reshape() to avoid the compilation warning: warning: suggest braces around empty body in an ‘if’ statement when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| * md: fix deadlock causing by sysfs_notifyJunxiao Bi2020-07-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following deadlock was captured. The first process is holding 'kernfs_mutex' and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device, this pending bio list would be flushed by second process 'md127_raid1', but it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace sysfs_notify() can fix it. There were other sysfs_notify() invoked from io path, removed all of them. PID: 40430 TASK: ffff8ee9c8c65c40 CPU: 29 COMMAND: "probe_file" #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06 #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6 #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs] #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs] #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs] #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs] #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs] #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs] #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7 #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93 #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968 #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2 #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445 #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a #16 [ffffb87c4df37958] new_slab at ffffffff9a251430 #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85 #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89 #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10 #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854 #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377 #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118 #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83 #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619 #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566 #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787 #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949 #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad RIP: 00007f617a5f2905 RSP: 00007f607334f838 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f6064044b20 RCX: 00007f617a5f2905 RDX: 00007f6064044b20 RSI: 00007f6064044b20 RDI: 00007f6064005890 RBP: 00007f6064044aa0 R8: 0000000000000030 R9: 000000000000011c R10: 0000000000000013 R11: 0000000000000246 R12: 00007f606417e6d0 R13: 00007f6064044aa0 R14: 00007f6064044b10 R15: 00000000ffffffff ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b PID: 927 TASK: ffff8f15ac5dbd80 CPU: 42 COMMAND: "md127_raid1" #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06 #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013 #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83 #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696 #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5 #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1] #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348 #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005 #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* | writeback: remove bdi->congested_fnChristoph Hellwig2020-07-091-26/+0
|/ | | | | | | | | | | | | Except for pktdvd, the only places setting congested bits are file systems that allocate their own backing_dev_info structures. And pktdvd is a deprecated driver that isn't useful in stack setup either. So remove the dead congested_fn stacking infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [axboe: fixup unused variables in bcache/request.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: rename generic_make_request to submit_bio_noacctChristoph Hellwig2020-07-011-14/+14
| | | | | | | | | generic_make_request has always been very confusingly misnamed, so rename it to submit_bio_noacct to make it clear that it is submit_bio minus accounting and a few checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md/raid10: prevent access of uninitialized resync_pages offsetJohn Pittman2019-11-121-1/+1
| | | | | | | | | | | | | | | | | Due to unneeded multiplication in the out_free_pages portion of r10buf_pool_alloc(), when using a 3-copy raid10 layout, it is possible to access a resync_pages offset that has not been initialized. This access translates into a crash of the system within resync_free_pages() while passing a bad pointer to put_page(). Remove the multiplication, preventing access to the uninitialized area. Fixes: f0250618361db ("md: raid10: don't use bio's vec table to manage resync pages") Cc: stable@vger.kernel.org # 4.12+ Signed-off-by: John Pittman <jpittman@redhat.com> Suggested-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md: improve handling of bio with REQ_PREFLUSH in md_flush_request()David Jeffery2019-10-251-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | If pers->make_request fails in md_flush_request(), the bio is lost. To fix this, pass back a bool to indicate if the original make_request call should continue to handle the I/O and instead of assuming the flush logic will push it to completion. Convert md_flush_request to return a bool and no longer calls the raid driver's make_request function. If the return is true, then the md flush logic has or will complete the bio and the md make_request call is done. If false, then the md make_request function needs to keep processing like it is a normal bio. Let the original call to md_handle_request handle any need to retry sending the bio to the raid driver's make_request function should it be needed. Also mark md_flush_request and the make_request function pointer as __must_check to issue warnings should these critical return values be ignored. Fixes: 2bc13b83e629 ("md: batch flush requests.") Cc: stable@vger.kernel.org # # v4.19+ Cc: NeilBrown <neilb@suse.com> Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md: allow last device to be forcibly removed from RAID1/RAID10.Guoqing Jiang2019-08-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | When the 'last' device in a RAID1 or RAID10 reports an error, we do not mark it as failed. This would serve little purpose as there is no risk of losing data beyond that which is obviously lost (as there is with RAID5), and there could be other sectors on the device which are readable, and only readable from this device. This in general this maximises access to data. However the current implementation also stops an admin from removing the last device by direct action. This is rarely useful, but in many case is not harmful and can make automation easier by removing special cases. Also, if an attempt to write metadata fails the device must be marked as faulty, else an infinite loop will result, attempting to update the metadata on all non-faulty devices. So add 'fail_last_dev' member to 'struct mddev', then we can bypasses the 'last disk' checks for RAID1 and RAID10, and control the behavior per array by change sysfs node. Signed-off-by: NeilBrown <neilb@suse.de> [add sysfs node for fail_last_dev by Guoqing] Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md/raid10: end bio when the device faultyYufen Yu2019-08-071-12/+14
| | | | | | | | Just like raid1, we do not queue write error bio to retry write and acknowlege badblocks, when the device is faulty. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md/raid10: read balance chooses idlest disk for SSDGuoqing Jiang2019-06-151-12/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | Andy reported that raid10 array with SSD disks has poor read performance. Compared with raid1, RAID-1 can be 3x faster than RAID-10 sometimes [1]. The thing is that raid10 chooses the low distance disk for read request, however, the approach doesn't work well for SSD device since it doesn't have spindle like HDD, we should just read from the SSD which has less pending IO like commit 9dedf60313fa4 ("md/raid1: read balance chooses idlest disk for SSD"). So this commit selects the idlest SSD disk for read if array has none rotational disk, otherwise, read_balance uses the previous distance priority algorithm. With the change, the performance of raid10 gets increased largely per Andy's test [2]. [1]. https://marc.info/?l=linux-raid&m=155915890004761&w=2 [2]. https://marc.info/?l=linux-raid&m=155990654223786&w=2 Tested-by: Andy Smith <andy@strugglers.net> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md: raid1-10: Unify r{1,10}bio_pool_freeMarcos Paulo de Souza2019-06-151-8/+3
| | | | | | | Avoiding duplicated code, since they just execute a kfree. Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md: raid10: Use struct_size() in kmalloc()Gustavo A. R. Silva2019-06-151-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kmalloc(size, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* drivers: md: Unify common definitions of raid1 and raid10Marcos Paulo de Souza2019-06-151-26/+1
| | | | | | | | These definitions are being moved to raid1-10.c. Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47Thomas Gleixner2019-05-241-10/+1
| | | | | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 or at your option any later version you should have received a copy of the gnu general public license for example usr src linux copying if not write to the free software foundation inc 675 mass ave cambridge ma 02139 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 20 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md: Fix failed allocation of md_register_threadAditya Pakki2019-03-121-0/+2
| | | | | | | | | | | | | | mddev->sync_thread can be set to NULL on kzalloc failure downstream. The patch checks for such a scenario and frees allocated resources. Committer node: Added similar fix to raid5.c, as suggested by Guoqing. Cc: stable@vger.kernel.org # v3.16+ Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Aditya Pakki <pakki001@umn.edu> Signed-off-by: Song Liu <songliubraving@fb.com>
* It's wrong to add len to sector_nr in raid10 reshape twiceXiao Ni2019-03-121-1/+0
| | | | | | | | | | | In reshape_request it already adds len to sector_nr already. It's wrong to add len to sector_nr again after adding pages to bio. If there is bad block it can't copy one chunk at a time, it needs to goto read_more. Now the sector_nr is wrong. It can cause data corruption. Cc: stable@vger.kernel.org # v3.16+ Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md: fix raid10 hang issue caused by barrierGuoqing Jiang2018-12-201-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When both regular IO and resync IO happen at the same time, and if we also need to split regular. Then we can see tasks hang due to barrier. 1. resync thread [ 1463.757205] INFO: task md1_resync:5215 blocked for more than 480 seconds. [ 1463.757207] Not tainted 4.19.5-1-default #1 [ 1463.757209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1463.757212] md1_resync D 0 5215 2 0x80000000 [ 1463.757216] Call Trace: [ 1463.757223] ? __schedule+0x29a/0x880 [ 1463.757231] ? raise_barrier+0x8d/0x140 [raid10] [ 1463.757236] schedule+0x78/0x110 [ 1463.757243] raise_barrier+0x8d/0x140 [raid10] [ 1463.757248] ? wait_woken+0x80/0x80 [ 1463.757257] raid10_sync_request+0x1f6/0x1e30 [raid10] [ 1463.757265] ? _raw_spin_unlock_irq+0x22/0x40 [ 1463.757284] ? is_mddev_idle+0x125/0x137 [md_mod] [ 1463.757302] md_do_sync.cold.78+0x404/0x969 [md_mod] [ 1463.757311] ? wait_woken+0x80/0x80 [ 1463.757336] ? md_rdev_init+0xb0/0xb0 [md_mod] [ 1463.757351] md_thread+0xe9/0x140 [md_mod] [ 1463.757358] ? _raw_spin_unlock_irqrestore+0x2e/0x60 [ 1463.757364] ? __kthread_parkme+0x4c/0x70 [ 1463.757369] kthread+0x112/0x130 [ 1463.757374] ? kthread_create_worker_on_cpu+0x40/0x40 [ 1463.757380] ret_from_fork+0x3a/0x50 2. regular IO [ 1463.760679] INFO: task kworker/0:8:5367 blocked for more than 480 seconds. [ 1463.760683] Not tainted 4.19.5-1-default #1 [ 1463.760684] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1463.760687] kworker/0:8 D 0 5367 2 0x80000000 [ 1463.760718] Workqueue: md submit_flushes [md_mod] [ 1463.760721] Call Trace: [ 1463.760731] ? __schedule+0x29a/0x880 [ 1463.760741] ? wait_barrier+0xdd/0x170 [raid10] [ 1463.760746] schedule+0x78/0x110 [ 1463.760753] wait_barrier+0xdd/0x170 [raid10] [ 1463.760761] ? wait_woken+0x80/0x80 [ 1463.760768] raid10_write_request+0xf2/0x900 [raid10] [ 1463.760774] ? wait_woken+0x80/0x80 [ 1463.760778] ? mempool_alloc+0x55/0x160 [ 1463.760795] ? md_write_start+0xa9/0x270 [md_mod] [ 1463.760801] ? try_to_wake_up+0x44/0x470 [ 1463.760810] raid10_make_request+0xc1/0x120 [raid10] [ 1463.760816] ? wait_woken+0x80/0x80 [ 1463.760831] md_handle_request+0x121/0x190 [md_mod] [ 1463.760851] md_make_request+0x78/0x190 [md_mod] [ 1463.760860] generic_make_request+0x1c6/0x470 [ 1463.760870] raid10_write_request+0x77a/0x900 [raid10] [ 1463.760875] ? wait_woken+0x80/0x80 [ 1463.760879] ? mempool_alloc+0x55/0x160 [ 1463.760895] ? md_write_start+0xa9/0x270 [md_mod] [ 1463.760904] raid10_make_request+0xc1/0x120 [raid10] [ 1463.760910] ? wait_woken+0x80/0x80 [ 1463.760926] md_handle_request+0x121/0x190 [md_mod] [ 1463.760931] ? _raw_spin_unlock_irq+0x22/0x40 [ 1463.760936] ? finish_task_switch+0x74/0x260 [ 1463.760954] submit_flushes+0x21/0x40 [md_mod] So resync io is waiting for regular write io to complete to decrease nr_pending (conf->barrier++ is called before waiting). The regular write io splits another bio after call wait_barrier which call nr_pending++, then the splitted bio would continue with raid10_write_request -> wait_barrier, so the splitted bio has to wait for barrier to be zero, then deadlock happens as follows. resync io regular io raise_barrier wait_barrier generic_make_request wait_barrier To resolve the issue, we need to call allow_barrier to decrease nr_pending before generic_make_request since regular IO is not issued to underlying devices, and wait_barrier is called again to ensure no internal IO happening. Fixes: fc9977dd069e ("md/raid10: simplify the splitting of requests.") Reported-and-tested-by: Siniša Bandin <sinisa@4net.rs> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* raid10: refactor common wait code from regular read/write requestGuoqing Jiang2018-12-201-47/+25
| | | | | | | | | | Both raid10_read_request and raid10_write_request share the same code at the beginning of them, so introduce regular_request_wait to clean up code, and call it in both request functions. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: introduce resync_info_get interface for sanity checkGuoqing Jiang2018-10-181-1/+7
| | | | | | | | | | Since the resync region from suspend_info means one node is reshaping this area, so the position of reshape_progress should be included in the area. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster/raid10: support add disk under grow modeGuoqing Jiang2018-10-181-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For clustered raid10 scenario, we need to let all the nodes know about that a new disk is added to the array, and the reshape caused by add new member just need to be happened in one node, but other nodes should know about the change. Since reshape means read data from somewhere (which is already used by array) and write data to unused region. Obviously, it is awful if one node is reading data from address while another node is writing to the same address. Considering we have implemented suspend writes in the resyncing area, so we can just broadcast the reading address to other nodes to avoid the trouble. For master node, it would call reshape_request then update sb during the reshape period. To avoid above trouble, we call resync_info_update to send RESYNC message in reshape_request. Then from slave node's view, it receives two type messages: 1. RESYNCING message Slave node add the address (where master node reading data from) to suspend list. 2. METADATA_UPDATED message Once slave nodes know the reshaping is started in master node, it is time to update reshape position and call start_reshape to follow master node's step. After reshape is done, only reshape position is need to be updated, so the majority task of reshaping is happened on the master node. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster/raid10: resize all the bitmaps before start reshapeGuoqing Jiang2018-10-181-3/+38
| | | | | | | | | | | | | | | | | | | | | To support add disk under grow mode, we need to resize all the bitmaps of each node before reshape, so that we can ensure all nodes have the same view of the bitmap of the clustered raid. So after the master node resized the bitmap, it broadcast a message to other slave nodes, and it checks the size of each bitmap are same or not by compare pages. We can only continue the reshaping after all nodes update the bitmap to the same size (by checking the pages), otherwise revert bitmap size to previous value. The resize_bitmaps interface and BITMAP_RESIZE message are introduced in md-cluster.c for the purpose. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* MD: fix invalid stored role for a disk - try2Shaohua Li2018-10-151-0/+1
| | | | | | | | | | | Commit d595567dc4f0 (MD: fix invalid stored role for a disk) broke linear hotadd. Let's only fix the role for disks in raid1/10. Based on Guoqing's original patch. Reported-by: kernel test robot <rong.a.chen@intel.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid10: Fix raid10 replace hang when new added disk faultyAlex Wu2018-09-281-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Symptom] Resync thread hang when new added disk faulty during replacing. [Root Cause] In raid10_sync_request(), we expect to issue a bio with callback end_sync_read(), and a bio with callback end_sync_write(). In normal situation, we will add resyncing sectors into mddev->recovery_active when raid10_sync_request() returned, and sub resynced sectors from mddev->recovery_active when end_sync_write() calls end_sync_request(). If new added disk, which are replacing the old disk, is set faulty, there is a race condition: 1. In the first rcu protected section, resync thread did not detect that mreplace is set faulty and pass the condition. 2. In the second rcu protected section, mreplace is set faulty. 3. But, resync thread will prepare the read object first, and then check the write condition. 4. It will find that mreplace is set faulty and do not have to prepare write object. This cause we add resync sectors but never sub it. [How to Reproduce] This issue can be easily reproduced by the following steps: mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd] mdadm /dev/md0 -a /dev/sde mdadm /dev/md0 --replace /dev/sdd sleep 1 mdadm /dev/md0 -f /dev/sde [How to Fix] This issue can be fixed by using local variables to record the result of test conditions. Once the conditions are satisfied, we can make sure that we need to issue a bio for read and a bio for write. Previous 'commit 24afd80d99f8 ("md/raid10: handle recovery of replacement devices.")' will also check whether bio is NULL, but leave the comment saying that it is a pointless test. So we remove this dummy check. Reported-by: Alex Chen <alexchen@synology.com> Reviewed-by: Allen Peng <allenpeng@synology.com> Reviewed-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Alex Wu <alexwu@synology.com> Signed-off-by: Shaohua Li <shli@fb.com>
* RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0Xiao Ni2018-09-011-1/+4
| | | | | | | | | | | | | | | | | | | In raid10 reshape_request it gets max_sectors in read_balance. If the underlayer disks have bad blocks, the max_sectors is less than last. It will call goto read_more many times. It calls raise_barrier(conf, sectors_done != 0) every time. In this condition sectors_done is not 0. So the value passed to the argument force of raise_barrier is true. In raise_barrier it checks conf->barrier when force is true. If force is true and conf->barrier is 0, it panic. In this case reshape_request submits bio to under layer disks. And in the callback function of the bio it calls lower_barrier. If the bio finishes before calling raise_barrier again, it can trigger the BUG_ON. Add one pair of raise_barrier/lower_barrier to fix this bug. Signed-off-by: Xiao Ni <xni@redhat.com> Suggested-by: Neil Brown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2018-08-191-27/+25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: - a new driver for Rohm BU21029 touch controller - new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free - updates to Atmel, eeti. pxrc and iforce drivers - assorted driver cleanups and fixes. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits) MAINTAINERS: Add PhoenixRC Flight Controller Adapter Input: do not use WARN() in input_alloc_absinfo() Input: mark expected switch fall-throughs Input: raydium_i2c_ts - use true and false for boolean values Input: evdev - switch to bitmap API Input: gpio-keys - switch to bitmap_zalloc() Input: elan_i2c_smbus - cast sizeof to int for comparison bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free() md: Avoid namespace collision with bitmap API dm: Avoid namespace collision with bitmap API Input: pm8941-pwrkey - add resin entry Input: pm8941-pwrkey - abstract register offsets and event code Input: iforce - reorganize joystick configuration lists Input: atmel_mxt_ts - move completion to after config crc is updated Input: atmel_mxt_ts - don't report zero pressure from T9 Input: atmel_mxt_ts - zero terminate config firmware file Input: atmel_mxt_ts - refactor config update code to add context struct Input: atmel_mxt_ts - config CRC may start at T71 Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE ...
| * md: Avoid namespace collision with bitmap APIAndy Shevchenko2018-08-021-27/+25
| | | | | | | | | | | | | | | | | | | | | | | | bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods. On the other hand MD bitmap API is special case. Adding 'md' prefix to it to avoid name space collision. No functional changes intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Shaohua Li <shli@kernel.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
* | md/raid10: fix that replacement cannot complete recovery after reassembleBingJing Chang2018-06-281-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During assemble, the spare marked for replacement is not checked. conf->fullsync cannot be updated to be 1. As a result, recovery will treat it as a clean array. All recovering sectors are skipped. Original device is replaced with the not-recovered spare. mdadm -C /dev/md0 -l10 -n4 -pn2 /dev/loop[0123] mdadm /dev/md0 -a /dev/loop4 mdadm /dev/md0 --replace /dev/loop0 mdadm -S /dev/md0 # stop array during recovery mdadm -A /dev/md0 /dev/loop[01234] After reassemble, you can see recovery go on, but it completes immediately. In fact, recovery is not actually processed. To solve this problem, we just add the missing logics for replacment spares. (In raid1.c or raid5.c, they have already been checked.) Reported-by: Alex Chen <alexchen@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | treewide: kzalloc() -> kcalloc()Kees Cook2018-06-131-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
* | treewide: kmalloc() -> kmalloc_array()Kees Cook2018-06-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
* | Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds2018-06-091-4/+6
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull MD updates from Shaohua Li: "A few fixes of MD for this merge window. Mostly bug fixes: - raid5 stripe batch fix from Amy - Read error handling for raid1 FailFast device from Gioh - raid10 recovery NULL pointer dereference fix from Guoqing - Support write hint for raid5 stripe cache from Mariusz - Fixes for device hot add/remove from Neil and Yufen - Improve flush bio scalability from Xiao" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: MD: fix lock contention for flush bios md/raid5: Assigning NULL to sh->batch_head before testing bit R5_Overlap of a stripe md/raid1: add error handling of read error from FailFast device md: fix NULL dereference of mddev->pers in remove_and_add_spares() raid5: copy write hint from origin bio to stripe md: fix two problems with setting the "re-add" device state. raid10: check bio in r10buf_pool_free to void NULL pointer dereference md: fix an error code format and remove unsed bio_sector
| * | raid10: check bio in r10buf_pool_free to void NULL pointer dereferenceGuoqing Jiang2018-05-011-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For recovery case, r10buf_pool_alloc only allocates 2 bios, so we can't access more than 2 bios in r10buf_pool_free. Otherwise, we can see NULL pointer dereference as follows: [ 98.347009] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [ 98.355783] IP: r10buf_pool_free+0x38/0xe0 [raid10] [...] [ 98.543734] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 98.550161] CR2: 0000000000000050 CR3: 000000089500a001 CR4: 00000000001606f0 [ 98.558145] Call Trace: [ 98.560881] <IRQ> [ 98.563136] put_buf+0x19/0x20 [raid10] [ 98.567426] end_sync_request+0x6b/0x70 [raid10] [ 98.572591] end_sync_write+0x9b/0x160 [raid10] [ 98.577662] blk_update_request+0x78/0x2c0 [ 98.582254] scsi_end_request+0x2c/0x1e0 [scsi_mod] [ 98.587719] scsi_io_completion+0x22f/0x610 [scsi_mod] [ 98.593472] blk_done_softirq+0x8e/0xc0 [ 98.597767] __do_softirq+0xde/0x2b3 [ 98.601770] irq_exit+0xae/0xb0 [ 98.605285] do_IRQ+0x81/0xd0 [ 98.608606] common_interrupt+0x7d/0x7d [ 98.612898] </IRQ> So we need to check the bio is valid or not before the bio is used in r10buf_pool_free. Another workable way is to free 2 bios for recovery case just like r10buf_pool_alloc. Fixes: f0250618361d ("md: raid10: don't use bio's vec table to manage resync pages") Reported-by: Alexis Castilla <pencerval@gmail.com> Tested-by: Alexis Castilla <pencerval@gmail.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | md: fix an error code format and remove unsed bio_sectorYufen Yu2018-05-011-1/+1
| |/ | | | | | | | | Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
* / md: convert to bioset_init()/mempool_init()Kent Overstreet2018-05-301-31/+29
|/ | | | | | | Convert md to embedded bio sets. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-blockLinus Torvalds2018-04-051-3/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer updates from Jens Axboe: "It's a pretty quiet round this time, which is nice. This contains: - series from Bart, cleaning up the way we set/test/clear atomic queue flags. - series from Bart, fixing races between gendisk and queue registration and removal. - set of bcache fixes and improvements from various folks, by way of Michael Lyle. - set of lightnvm updates from Matias, most of it being the 1.2 to 2.0 transition. - removal of unused DIO flags from Nikolay. - blk-mq/sbitmap memory ordering fixes from Omar. - divide-by-zero fix for BFQ from Paolo. - minor documentation patches from Randy. - timeout fix from Tejun. - Alpha "can't write a char atomically" fix from Mikulas. - set of NVMe fixes by way of Keith. - bsg and bsg-lib improvements from Christoph. - a few sed-opal fixes from Jonas. - cdrom check-disk-change deadlock fix from Maurizio. - various little fixes, comment fixes, etc from various folks" * tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits) blk-mq: Directly schedule q->timeout_work when aborting a request blktrace: fix comment in blktrace_api.h lightnvm: remove function name in strings lightnvm: pblk: remove some unnecessary NULL checks lightnvm: pblk: don't recover unwritten lines lightnvm: pblk: implement 2.0 support lightnvm: pblk: implement get log report chunk lightnvm: pblk: rename ppaf* to addrf* lightnvm: pblk: check for supported version lightnvm: implement get log report chunk helpers lightnvm: make address conversions depend on generic device lightnvm: add support for 2.0 address format lightnvm: normalize geometry nomenclature lightnvm: complete geo structure with maxoc* lightnvm: add shorten OCSSD version in geo lightnvm: add minor version to generic geometry lightnvm: simplify geometry structure lightnvm: pblk: refactor init/exit sequences lightnvm: Avoid validation of default op value lightnvm: centralize permission check for lightnvm ioctl ...
| * block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()Bart Van Assche2018-03-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch has been generated as follows: for verb in set_unlocked clear_unlocked set clear; do replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \ $(git grep -lw queue_flag_${verb} drivers block/bsg*) done Except for protecting all queue flag changes with the queue lock this patch does not change any functionality. Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | md: fix a potential deadlock of raid5/raid10 reshapeBingJing Chang2018-02-251-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a potential deadlock if mount/umount happens when raid5_finish_reshape() tries to grow the size of emulated disk. How the deadlock happens? 1) The raid5 resync thread finished reshape (expanding array). 2) The mount or umount thread holds VFS sb->s_umount lock and tries to write through critical data into raid5 emulated block device. So it waits for raid5 kernel thread handling stripes in order to finish it I/Os. 3) In the routine of raid5 kernel thread, md_check_recovery() will be called first in order to reap the raid5 resync thread. That is, raid5_finish_reshape() will be called. In this function, it will try to update conf and call VFS revalidate_disk() to grow the raid5 emulated block device. It will try to acquire VFS sb->s_umount lock. The raid5 kernel thread cannot continue, so no one can handle mount/ umount I/Os (stripes). Once the write-through I/Os cannot be finished, mount/umount will not release sb->s_umount lock. The deadlock happens. The raid5 kernel thread is an emulated block device. It is responible to handle I/Os (stripes) from upper layers. The emulated block device should not request any I/Os on itself. That is, it should not call VFS layer functions. (If it did, it will try to acquire VFS locks to guarantee the I/Os sequence.) So we have the resync thread to send resync I/O requests and to wait for the results. For solving this potential deadlock, we can put the size growth of the emulated block device as the final step of reshape thread. 2017/12/29: Thanks to Guoqing Jiang <gqjiang@suse.com>, we confirmed that there is the same deadlock issue in raid10. It's reproducible and can be fixed by this patch. For raid10.c, we can remove the similar code to prevent deadlock as well since they has been called before. Reported-by: Alex Wu <alexwu@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md-cluster: choose correct label when clustered layout is not supportedLidong Zhong2018-02-251-1/+1
| | | | | | | | | | | | | | | | r10conf is already successfully allocated before checking the layout Signed-off-by: Lidong Zhong <lzhong@suse.com> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md raid10: fix NULL deference in handle_write_completed()Yufen Yu2018-02-191-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case of 'recover', an r10bio with R10BIO_WriteError & R10BIO_IsRecover will be progressed by handle_write_completed(). This function traverses all r10bio->devs[copies]. If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement is also not NULL. However, this is not always true. When there is an rdev of raid10 has replacement, then each r10bio ->devs[m].repl_bio != NULL in conf->r10buf_pool. However, in 'recover', even if corresponded replacement is NULL, it doesn't clear r10bio ->devs[m].repl_bio, resulting in replacement NULL deference. This bug was introduced when replacement support for raid10 was added in Linux 3.3. As NeilBrown suggested: Elsewhere the determination of "is this device part of the resync/recovery" is made by resting bio->bi_end_io. If this is end_sync_write, then we tried to write here. If it is NULL, then we didn't try to write. Fixes: 9ad1aefc8ae8 ("md/raid10: Handle replacement devices during resync.") Cc: stable (V3.3+) Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | raid10: change the size of resync window for clustered raidGuoqing Jiang2018-02-171-1/+1
|/ | | | | | | | | | To align with raid1's resync window, we need to set the resync window of raid10 to 32M as well. Fixes: 8db87912c9a8 ("md-cluster: Use a small window for raid10 resync") Reported-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* md/raid1,raid10: silence warning about wait-within-waitNeilBrown2017-12-111-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | If you prepare_to_wait() after a previous prepare_to_wait(), but before calling schedule(), you get warning: do not call blocking ops when !TASK_RUNNING; state=2 This is appropriate as it is often a bug. The event that the first prepare_to_wait() expects might wake up the schedule following the second prepare_to_wait(), which could be confusing. However if both prepare_to_wait()s are part of simple wait_event() loops, and if the inner one is rarely called, then there is no problem. The inner loop is too simple to get confused by a stray wakeup, and the outer loop won't spin unduly because the inner doesnt affect it often. This pattern occurs in both raid1.c and raid10.c in the use of flush_pending_writes(). The warning can be silenced by setting current->state to TASK_RUNNING. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid1/10: add missed blk plugShaohua Li2017-12-011-0/+4
| | | | | | | flush_pending_writes isn't always called with block plug, so add it, and plug works in nested way. Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: Use a small window for raid10 resyncGuoqing Jiang2017-11-021-1/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window is maintained in r10conf as cluster_sync_low and cluster_sync_high, and processed in raid10's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Set cluster_sync_high to cluster_sync_low + stripe size. 3. Send a message to all nodes so they may add it in their suspension list. Note: We only support "near" raid10 so far, resync a far or offset raid10 array could have trouble. So raid10_run checks the layout of clustered raid10, it will refuse to run if the layout is not correct. With the "near" layout we process one stripe at a time progressing monotonically through the address space. So we can have a sliding window of whole-stripes which moves through the array suspending IO on other nodes, and both resync which uses array addresses and recovery which uses device addresses can stay within this window. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>