summaryrefslogtreecommitdiffstats
path: root/drivers/md (follow)
Commit message (Collapse)AuthorAgeFilesLines
* dm stats: fix a leaked s->histogram_boundaries arrayMikulas Patocka2017-02-161-0/+1
| | | | | | | Fixes: dfcfac3e4cd9 ("dm stats: collect and report histogram of IO latencies") Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm space map metadata: constify dm_space_map structuresBhumika Goyal2017-02-161-2/+2
| | | | | | | | | | | | | | | | | | Declare dm_space_map structures as const as they are only passed as an argument to the function memcpy. This argument is of type const void *, so dm_space_map structures having this property can be declared as const. File size before: text data bss dec hex filename 4889 240 0 5129 1409 dm-space-map-metadata.o File size after: text data bss dec hex filename 5139 0 0 5139 1413 dm-space-map-metadata.o Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm cache metadata: use cursor api in blocks_are_clean_separate_dirty()Mike Snitzer2017-02-161-7/+26
| | | | Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm persistent data: add cursor skip functions to the cursor APIsJoe Thornber2017-02-166-0/+70
| | | | | Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm cache metadata: use dm_bitset_new() to create the dirty bitset in format 2Joe Thornber2017-02-161-11/+11
| | | | | | | Big speed up with large configs. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm bitset: add dm_bitset_new()Joe Thornber2017-02-162-0/+58
| | | | | | | A more efficient way of creating a populated bitset. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm cache metadata: name the cache block that couldn't be loadedMike Snitzer2017-02-161-4/+8
| | | | | | | | Improves __load_mapping_v1() and __load_mapping_v2() DMERR messages to explicitly name the cache block number whose mapping couldn't be loaded. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm cache metadata: add "metadata2" featureJoe Thornber2017-02-163-53/+274
| | | | | | | | | If "metadata2" is provided as a table argument when creating/loading a cache target a more compact metadata format, with separate dirty bits, is used. "metadata2" improves speed of shutting down a cache target. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm cache metadata: use bitset cursor api to load discard bitsetJoe Thornber2017-02-161-20/+28
| | | | | Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm bitset: introduce cursor apiJoe Thornber2017-02-162-0/+91
| | | | | Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm btree: use GFP_NOFS in dm_btree_del()Joe Thornber2017-02-161-1/+6
| | | | | | | dm_btree_del() is called from an ioctl so don't recurse into FS. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm space map common: memcpy the disk root to ensure it's arch alignedJoe Thornber2017-02-161-5/+11
| | | | | | | | | | | | | The metadata_space_map_root passed to sm_ll_open_metadata() may or may not be arch aligned, use memcpy to ensure it is. This is not a fast path so the extra memcpy doesn't hurt us. Long-term it'd be better to use the kernel's alignment infrastructure to remove the memcpy()s that are littered across persistent-data (btree, array, space-maps, etc). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm block manager: add unlikely() annotations on dm_bufio error pathsJoe Thornber2017-02-161-4/+4
| | | | | Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm cache: fix corruption seen when using cache > 2TBJoe Thornber2017-02-161-3/+3
| | | | | | | | | | | A rounding bug due to compiler generated temporary being 32bit was found in remap_to_cache(). A localized cast in remap_to_cache() fixes the corruption but this preferred fix (changing from uint32_t to sector_t) eliminates potential for future rounding errors elsewhere. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm raid: cleanup awkward branching in raid_message() option processingMike Snitzer2017-01-251-3/+4
| | | | Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm raid: use mddev rather than rdev->mddevHeinz Mauelshagen2017-01-251-1/+1
| | | | | Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm raid: use read_disk_sb() throughoutHeinz Mauelshagen2017-01-251-8/+9
| | | | | | | | | | | For consistency, call read_disk_sb() from attempt_restore_of_faulty_devices() instead of calling sync_page_io() directly. Explicitly set device to faulty on superblock read error. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm raid: add raid4/5/6 journaling supportHeinz Mauelshagen2017-01-251-21/+140
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add md raid4/5/6 journaling support (upstream commit bac624f3f86a started the implementation) which closes the write hole (i.e. non-atomic updates to stripes) using a dedicated journal device. Background: raid4/5/6 stripes hold N data payloads per stripe plus one parity raid4/5 or two raid6 P/Q syndrome payloads in an in-memory stripe cache. Parity or P/Q syndromes used to recover any data payloads in case of a disk failure are calculated from the N data payloads and need to be updated on the different component devices of the raid device. Those are non-atomic, persistent updates. Hence a crash can cause failure to update all stripe payloads persistently and thus cause data loss during stripe recovery. This problem gets addressed by writing whole stripe cache entries (together with journal metadata) to a persistent journal entry on a dedicated journal device. Only if that journal entry is written successfully, the stripe cache entry is updated on the component devices of the raid device (i.e. writethrough type). In case of a crash, the entry can be recovered from the journal and be written again thus ensuring consistent stripe payload suitable to data recovery. Future dependencies: once writeback caching being worked on to compensate for the throughput implictions involved with writethrough overhead is supported with journaling in upstream, an additional patch based on this one will support it in dm-raid. Journal resilience related remarks: because stripes are recovered from the journal in case of a crash, the journal device better be resilient. Resilience becomes mandatory with future writeback support, because loosing the working set in the log means data loss as oposed to writethrough, were the loss of the journal device 'only' reintroduces the write hole. Fix comment on data offsets in parse_dev_params() and initialize new_data_offset as well. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm raid: be prepared to accept arbitrary '- -' tuplesHeinz Mauelshagen2017-01-251-5/+23
| | | | | | | | | | | | | | | During raid set resize checks and setting up the recovery offset in case a raid set grows, calculated rd->md.dev_sectors is compared to rs->dev[0].rdev.sectors. Device 0 may not be defined in case userspace passes in '- -' for it (lvm2 doesn't do that so far), thus it's device sectors can't be taken authoritatively in this comparison and another valid device must be used to retrieve the device size. Use mddev->dev_sectors in checking for ongoing recovery for the same reason. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm raid: fix transient device failure processingHeinz Mauelshagen2017-01-251-49/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fix addresses the following 3 failure scenarios: 1) If a (transiently) inaccessible metadata device is being passed into the constructor (e.g. a device tuple '254:4 254:5'), it is processed as if '- -' was given. This erroneously results in a status table line containing '- -', which mistakenly differs from what has been passed in. As a result, userspace libdevmapper puts the device tuple seperate from the RAID device thus not processing the dependencies properly. 2) False health status char 'A' instead of 'D' is emitted on the status status info line for the meta/data device tuple in this metadata device failure case. 3) If the metadata device is accessible when passed into the constructor but the data device (partially) isn't, that leg may be set faulty by the raid personality on access to the (partially) unavailable leg. Restore tried in a second raid device resume on such failed leg (status char 'D') fails after the (partial) leg returned. Fixes for aforementioned failure scenarios: - don't release passed in devices in the constructor thus allowing the status table line to e.g. contain '254:4 254:5' rather than '- -' - emit device status char 'D' rather than 'A' for the device tuple with the failed metadata device on the status info line - when attempting to restore faulty devices in a second resume, allow the device hot remove function to succeed by setting the device to not in-sync In case userspace intentionally passes '- -' into the constructor to avoid that device tuple (e.g. to split off a raid1 leg temporarily for later re-addition), the status table line will correctly show '- -' and the status info line will provide a '-' device health character for the non-defined device tuple. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* md/raid5: Use correct IS_ERR() variation on pointer checkJes Sorensen2017-01-091-1/+1
| | | | | | | | This fixes a build error on certain architectures, such as ppc64. Fixes: 6995f0b247e("md: takeover should clear unrelated bits") Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: cleanup mddev flag clear for takeoverShaohua Li2017-01-054-7/+26
| | | | | | | | | | Commit 6995f0b (md: takeover should clear unrelated bits) clear unrelated bits, but it's quite fragile. To avoid error in the future, define a macro for unsupported mddev flags for each raid type and use it to clear unsupported mddev flags. This should be less error-prone. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/r5cache: fix spelling mistake on "recoverying"Colin Ian King2017-01-051-1/+1
| | | | | | | | Trivial fix to spelling mistake "recoverying" to "recovering" in pr_dbg message. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/r5cache: assign conf->log before r5l_load_log()Song Liu2017-01-051-1/+3
| | | | | | | | | | r5l_load_log() calls functions that requires a proper conf->log, for example, r5c_is_writeback(). Therefore, we should set conf->log before calling r5l_load_log(). If r5l_load_log() fails, conf->log is set back to NULL. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/r5cache: simplify handling of sh->log_start in recoverySong Liu2017-01-051-15/+12
| | | | | | | | | | | | | We only need to update sh->log_start at the end of recovery, which is r5c_recovery_rewrite_data_only_stripes(), so it is not necessary to set it before that. In this patch, log_start is removed from r5c_recovery_alloc_stripe(). After updating all sh->log_start, rewrite_data_only_stripes() also updates log->next_checkpoints to the last sh->log_start. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid5-cache: removes unnecessary write-through mode judgmentsJackieLiu2017-01-051-3/+0
| | | | | | | | | The write-through mode has been returned in front of the function, do not need to do it again. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid10: Refactor raid10_make_requestRobert LeBlanc2017-01-031-105/+140
| | | | | | | | | | Refactor raid10_make_request into seperate read and write functions to clean up the code. Shaohua: add the recovery check back to read path Signed-off-by: Robert LeBlanc <robert@leblancnet.us> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid1: Refactor raid1_make_requestRobert LeBlanc2017-01-031-128/+139
| | | | | | | | Refactor raid1_make_request to make read and write code in their own functions to clean up the code. Signed-off-by: Robert LeBlanc <robert@leblancnet.us> Signed-off-by: Shaohua Li <shli@fb.com>
* Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds2016-12-241-1/+1
| | | | | | | | | | | | | This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bcache: partition support: add 16 minors per bcacheN deviceEric Wheeler2016-12-171-1/+4
| | | | | Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Wido den Hollander <wido@widodh.nl>
* bcache: Make gc wakeup sane, remove set_task_state()Kent Overstreet2016-12-175-26/+26
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* linux: drop __bitwise__ everywhereMichael S. Tsirkin2016-12-151-3/+3
| | | | | | | | | | | | | __bitwise__ used to mean "yes, please enable sparse checks unconditionally", but now that we dropped __CHECK_ENDIAN__ __bitwise is exactly the same. There aren't many users, replace it by __bitwise everywhere. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Akced-by: Lee Duncan <lduncan@suse.com>
* Merge tag 'dm-4.10-changes' of ↵Linus Torvalds2016-12-1419-173/+406
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - various fixes and improvements to request-based DM and DM multipath - some locking improvements in DM bufio - add Kconfig option to disable the DM block manager's extra locking which mainly serves as a developer tool - a few bug fixes to DM's persistent-data - a couple changes to prepare for multipage biovec support in the block layer - various improvements and cleanups in the DM core, DM cache, DM raid and DM crypt - add ability to have DM crypt use keys from the kernel key retention service - add a new "error_writes" feature to the DM flakey target, reads are left unchanged in this mode * tag 'dm-4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (40 commits) dm flakey: introduce "error_writes" feature dm cache policy smq: use hash_32() instead of hash_32_generic() dm crypt: reject key strings containing whitespace chars dm space map: always set ev if sm_ll_mutate() succeeds dm space map metadata: skip useless memcpy in metadata_ll_init_index() dm space map metadata: fix 'struct sm_metadata' leak on failed create Documentation: dm raid: define data_offset status field dm raid: fix discard support regression dm raid: don't allow "write behind" with raid4/5/6 dm mpath: use hw_handler_params if attached hw_handler is same as requested dm crypt: add ability to use keys from the kernel key retention service dm array: remove a dead assignment in populate_ablock_with_values() dm ioctl: use offsetof() instead of open-coding it dm rq: simplify use_blk_mq initialization dm: use blk_set_queue_dying() in __dm_destroy() dm bufio: drop the lock when doing GFP_NOIO allocation dm bufio: don't take the lock in dm_bufio_shrink_count dm bufio: avoid sleeping while holding the dm_bufio lock dm table: simplify dm_table_determine_type() dm table: an 'all_blk_mq' table must be loaded for a blk-mq DM device ...
| * dm flakey: introduce "error_writes" featureMike Snitzer2016-12-131-9/+42
| | | | | | | | | | | | | | | | | | | | | | Recent dm-flakey fixes, to have reads error out during the "down" interval, made it so that the previous read behaviour is no longer available. It is useful to have reads complete like normal but have writes error out, so make it possible again with a new "error_writes" feature. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache policy smq: use hash_32() instead of hash_32_generic()Mike Snitzer2016-12-091-1/+1
| | | | | | | | | | | | | | Switch to using hash_32() because hash_32_generic() should only be used by the kernel's selftests. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm crypt: reject key strings containing whitespace charsOndrej Kozina2016-12-081-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately key_string may theoretically contain whitespace even after it's processed by dm_split_args(). The reason for this is DM core supports escaping of almost all chars including any whitespace. If userspace passes a key to the kernel in format ":32:logon:my_prefix:my\ key" dm-crypt will look up key "my_prefix:my key" in kernel keyring service. So far everything's fine. Unfortunately if userspace later calls DM_TABLE_STATUS ioctl, it will not receive back expected ":32:logon:my_prefix:my\ key" but the unescaped version instead. Also userpace (most notably cryptsetup) is not ready to parse single target argument containing (even escaped) whitespace chars and any whitespace is simply taken as delimiter of another argument. This effect is mitigated by the fact libdevmapper curently performs double escaping of '\' char. Any user input in format "x\ x" is transformed into "x\\ x" before being passed to the kernel. Nonetheless dm-crypt may be used without libdevmapper. Therefore the near-term solution to this is to reject any key string containing whitespace. Signed-off-by: Ondrej Kozina <okozina@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm space map: always set ev if sm_ll_mutate() succeedsBenjamin Marzinski2016-12-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If no block was allocated or freed, sm_ll_mutate() wasn't setting *ev, leaving the variable unitialized. sm_ll_insert(), sm_disk_inc_block(), and sm_disk_new_block() all check ev to see if there was an allocation event in sm_ll_mutate(), possibly reading unitialized data. If no allocation event occured, sm_ll_mutate() should set *ev to SM_NONE. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm space map metadata: skip useless memcpy in metadata_ll_init_index()Benjamin Marzinski2016-12-081-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | When metadata_ll_init_index() is called by sm_ll_new_metadata(), ll->mi_le hasn't been initialized yet. So, when metadata_ll_init_index() copies the contents of ll->mi_le into the newly allocated bitmap_root, it is just copying garbage. ll->mi_le will be allocated later in sm_ll_extend() and copied into the bitmap_root, in sm_ll_commit(). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm space map metadata: fix 'struct sm_metadata' leak on failed createBenjamin Marzinski2016-12-081-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In dm_sm_metadata_create() we temporarily change the dm_space_map operations from 'ops' (whose .destroy function deallocates the sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't). If dm_sm_metadata_create() fails in sm_ll_new_metadata() or sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls dm_sm_destroy() with the intention of freeing the sm_metadata, but it doesn't (because the dm_space_map operations is still set to 'bootstrap_ops'). Fix this by setting the dm_space_map operations back to 'ops' if dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * dm raid: fix discard support regressionHeinz Mauelshagen2016-12-081-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ecbfb9f118 ("dm raid: add raid level takeover support") moved the configure_discard_support() call from raid_ctr() to raid_preresume(). Enabling/disabling discard _must_ happen during table load (through the .ctr hook). Fix this regression by moving the configure_discard_support() call back to raid_ctr(). Fixes: ecbfb9f118 ("dm raid: add raid level takeover support") Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm raid: don't allow "write behind" with raid4/5/6Heinz Mauelshagen2016-12-081-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove CTR_FLAG_MAX_WRITE_BEHIND from raid4/5/6's valid ctr flags. Only the md raid1 personality supports setting a maximum number of "write behind" write IOs on any legs set to "write mostly". "write mostly" enhances throughput with slow links/disks. Technically the "write behind" value is a write intent bitmap property only being respected by the raid1 personality. It allows a maximum number of "write behind" writes to any "write mostly" raid1 mirror legs to be delayed and avoids reads from such legs. No other MD personalities supported via dm-raid make use of "write behind", thus setting this property is superfluous; it wouldn't cause harm but it is correct to reject it. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm mpath: use hw_handler_params if attached hw_handler is same as requestedtang.junhui2016-12-081-5/+9
| | | | | | | | | | | | | | | | Let the requested m->hw_handler_params be used if the attached hardware handler is the same handler as requested with m->hw_handler_name. Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm crypt: add ability to use keys from the kernel key retention serviceOndrej Kozina2016-12-081-13/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel key service is a generic way to store keys for the use of other subsystems. Currently there is no way to use kernel keys in dm-crypt. This patch aims to fix that. Instead of key userspace may pass a key description with preceding ':'. So message that constructs encryption mapping now looks like this: <cipher> [<key>|:<key_string>] <iv_offset> <dev_path> <start> [<#opt_params> <opt_params>] where <key_string> is in format: <key_size>:<key_type>:<key_description> Currently we only support two elementary key types: 'user' and 'logon'. Keys may be loaded in dm-crypt either via <key_string> or using classical method and pass the key in hex representation directly. dm-crypt device initialised with a key passed in hex representation may be replaced with key passed in key_string format and vice versa. (Based on original work by Andrey Ryabinin) Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm array: remove a dead assignment in populate_ablock_with_values()Bart Van Assche2016-12-081-2/+0
| | | | | | | | | | | | | | A value is assigned to 'nr_entries' but is never used, remove it. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm ioctl: use offsetof() instead of open-coding itBart Van Assche2016-12-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | Subtracting sizes is a fragile approach because the result is only correct if the compiler has not added any padding at the end of the structure. Hence use offsetof() instead of size subtraction. An additional advantage of offsetof() is that it makes the intent more clear. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm rq: simplify use_blk_mq initializationBart Van Assche2016-12-081-5/+1
| | | | | | | | | | | | | | | | Use a single statement to declare and initialize 'use_blk_mq' instead of two statements. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: use blk_set_queue_dying() in __dm_destroy()Bart Van Assche2016-12-081-3/+1
| | | | | | | | | | | | | | | | | | | | After QUEUE_FLAG_DYING has been set any code that is waiting in get_request() should be woken up. But to get this behaviour blk_set_queue_dying() must be used instead of only setting QUEUE_FLAG_DYING. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm bufio: drop the lock when doing GFP_NOIO allocationMikulas Patocka2016-12-081-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | If the first allocation attempt using GFP_NOWAIT fails, drop the lock and retry using GFP_NOIO allocation (lock is dropped because the allocation can take some time). Note that we won't do GFP_NOIO allocation when we loop for the second time, because the lock shouldn't be dropped between __wait_for_free_buffer and __get_unclaimed_buffer. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm bufio: don't take the lock in dm_bufio_shrink_countMikulas Patocka2016-12-081-11/+2
| | | | | | | | | | | | | | | | | | | | dm_bufio_shrink_count() is called from do_shrink_slab to find out how many freeable objects are there. The reported value doesn't have to be precise, so we don't need to take the dm-bufio lock. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm bufio: avoid sleeping while holding the dm_bufio lockDouglas Anderson2016-12-081-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've seen in-field reports showing _lots_ (18 in one case, 41 in another) of tasks all sitting there blocked on: mutex_lock+0x4c/0x68 dm_bufio_shrink_count+0x38/0x78 shrink_slab.part.54.constprop.65+0x100/0x464 shrink_zone+0xa8/0x198 In the two cases analyzed, we see one task that looks like this: Workqueue: kverityd verity_prefetch_io __switch_to+0x9c/0xa8 __schedule+0x440/0x6d8 schedule+0x94/0xb4 schedule_timeout+0x204/0x27c schedule_timeout_uninterruptible+0x44/0x50 wait_iff_congested+0x9c/0x1f0 shrink_inactive_list+0x3a0/0x4cc shrink_lruvec+0x418/0x5cc shrink_zone+0x88/0x198 try_to_free_pages+0x51c/0x588 __alloc_pages_nodemask+0x648/0xa88 __get_free_pages+0x34/0x7c alloc_buffer+0xa4/0x144 __bufio_new+0x84/0x278 dm_bufio_prefetch+0x9c/0x154 verity_prefetch_io+0xe8/0x10c process_one_work+0x240/0x424 worker_thread+0x2fc/0x424 kthread+0x10c/0x114 ...and that looks to be the one holding the mutex. The problem has been reproduced on fairly easily: 0. Be running Chrome OS w/ verity enabled on the root filesystem 1. Pick test patch: http://crosreview.com/412360 2. Install launchBalloons.sh and balloon.arm from http://crbug.com/468342 ...that's just a memory stress test app. 3. On a 4GB rk3399 machine, run nice ./launchBalloons.sh 4 900 100000 ...that tries to eat 4 * 900 MB of memory and keep accessing. 4. Login to the Chrome web browser and restore many tabs With that, I've seen printouts like: DOUG: long bufio 90758 ms ...and stack trace always show's we're in dm_bufio_prefetch(). The problem is that we try to allocate memory with GFP_NOIO while we're holding the dm_bufio lock. Instead we should be using GFP_NOWAIT. Using GFP_NOIO can cause us to sleep while holding the lock and that causes the above problems. The current behavior explained by David Rientjes: It will still try reclaim initially because __GFP_WAIT (or __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO. This is the cause of contention on dm_bufio_lock() that the thread holds. You want to pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a mutex that can be contended by a concurrent slab shrinker (if count_objects didn't use a trylock, this pattern would trivially deadlock). This change significantly increases responsiveness of the system while in this state. It makes a real difference because it unblocks kswapd. In the bug report analyzed, kswapd was hung: kswapd0 D ffffffc000204fd8 0 72 2 0x00000000 Call trace: [<ffffffc000204fd8>] __switch_to+0x9c/0xa8 [<ffffffc00090b794>] __schedule+0x440/0x6d8 [<ffffffc00090bac0>] schedule+0x94/0xb4 [<ffffffc00090be44>] schedule_preempt_disabled+0x28/0x44 [<ffffffc00090d900>] __mutex_lock_slowpath+0x120/0x1ac [<ffffffc00090d9d8>] mutex_lock+0x4c/0x68 [<ffffffc000708e7c>] dm_bufio_shrink_count+0x38/0x78 [<ffffffc00030b268>] shrink_slab.part.54.constprop.65+0x100/0x464 [<ffffffc00030dbd8>] shrink_zone+0xa8/0x198 [<ffffffc00030e578>] balance_pgdat+0x328/0x508 [<ffffffc00030eb7c>] kswapd+0x424/0x51c [<ffffffc00023f06c>] kthread+0x10c/0x114 [<ffffffc000203dd0>] ret_from_fork+0x10/0x40 By unblocking kswapd memory pressure should be reduced. Suggested-by: David Rientjes <rientjes@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>