summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAPIra Weiny2021-03-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | The kernel test robot found that __kmap_local_sched_out() was not correctly skipping the guard pages when DEBUG_KMAP_LOCAL_FORCE_MAP was set.[1] This was due to DEBUG_HIGHMEM check being used. Change the configuration check to be correct. [1] https://lore.kernel.org/lkml/20210304083825.GB17830@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/20210318230657.1497881-1-ira.weiny@intel.com Fixes: 0e91a0c6984c ("mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP") Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> Cc: David Sterba <dsterba@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memblock: fix section mismatch warning againMike Rapoport2021-03-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 34dc2efb39a2 ("memblock: fix section mismatch warning") marked memblock_bottom_up() and memblock_set_bottom_up() as __init, but they could be referenced from non-init functions like memblock_find_in_range_node() on architectures that enable CONFIG_ARCH_KEEP_MEMBLOCK. For such builds kernel test robot reports: WARNING: modpost: vmlinux.o(.text+0x74fea4): Section mismatch in reference from the function memblock_find_in_range_node() to the function .init.text:memblock_bottom_up() The function memblock_find_in_range_node() references the function __init memblock_bottom_up(). This is often because memblock_find_in_range_node lacks a __init annotation or the annotation of memblock_bottom_up is wrong. Replace __init annotations with __init_memblock annotations so that the appropriate section will be selected depending on CONFIG_ARCH_KEEP_MEMBLOCK. Link: https://lore.kernel.org/lkml/202103160133.UzhgY0wt-lkp@intel.com Link: https://lkml.kernel.org/r/20210316171347.14084-1-rppt@kernel.org Fixes: 34dc2efb39a2 ("memblock: fix section mismatch warning") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: make compatible with kmemleakMarco Elver2021-03-252-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because memblock allocations are registered with kmemleak, the KFENCE pool was seen by kmemleak as one large object. Later allocations through kfence_alloc() that were registered with kmemleak via slab_post_alloc_hook() would then overlap and trigger a warning. Therefore, once the pool is initialized, we can remove (free) it from kmemleak again, since it should be treated as allocator-internal and be seen as "free memory". The second problem is that kmemleak is passed the rounded size, and not the originally requested size, which is also the size of KFENCE objects. To avoid kmemleak scanning past the end of an object and trigger a KFENCE out-of-bounds error, fix the size if it is a KFENCE object. For simplicity, to avoid a call to kfence_ksize() in slab_post_alloc_hook() (and avoid new IS_ENABLED(CONFIG_DEBUG_KMEMLEAK) guard), just call kfence_ksize() in mm/kmemleak.c:create_object(). Link: https://lkml.kernel.org/r/20210317084740.3099921-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Luis Henriques <lhenriques@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* gcov: fix clang-11+ supportNick Desaulniers2021-03-251-0/+69
| | | | | | | | | | | | | | | | | | | | | | | LLVM changed the expected function signatures for llvm_gcda_start_file() and llvm_gcda_emit_function() in the clang-11 release. Users of clang-11 or newer may have noticed their kernels failing to boot due to a panic when enabling CONFIG_GCOV_KERNEL=y +CONFIG_GCOV_PROFILE_ALL=y. Fix up the function signatures so calling these functions doesn't panic the kernel. Link: https://reviews.llvm.org/rGcdd683b516d147925212724b09ec6fb792a40041 Link: https://reviews.llvm.org/rG13a633b438b6500ecad9e4f936ebadf3411d0f44 Link: https://lkml.kernel.org/r/20210312224132.3413602-2-ndesaulniers@google.com Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reported-by: Prasad Sodagudi <psodagud@quicinc.com> Suggested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ia64: fix format strings for err_injectSergei Trofimovich2021-03-251-11/+11
| | | | | | | | | | | | | | | | Fix warning with %lx / u64 mismatch: arch/ia64/kernel/err_inject.c: In function 'show_resources': arch/ia64/kernel/err_inject.c:62:22: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'u64' {aka 'long long unsigned int'} 62 | return sprintf(buf, "%lx", name[cpu]); \ | ^~~~~~~ Link: https://lkml.kernel.org/r/20210313104312.1548232-1-slyfox@gentoo.org Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ia64: mca: allocate early mca with GFP_ATOMICSergei Trofimovich2021-03-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sleep warning happens at early boot right at secondary CPU activation bootup: smp: Bringing up secondary CPUs ... BUG: sleeping function called from invalid context at mm/page_alloc.c:4942 in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.12.0-rc2-00007-g79e228d0b611-dirty #99 .. Call Trace: show_stack+0x90/0xc0 dump_stack+0x150/0x1c0 ___might_sleep+0x1c0/0x2a0 __might_sleep+0xa0/0x160 __alloc_pages_nodemask+0x1a0/0x600 alloc_page_interleave+0x30/0x1c0 alloc_pages_current+0x2c0/0x340 __get_free_pages+0x30/0xa0 ia64_mca_cpu_init+0x2d0/0x3a0 cpu_init+0x8b0/0x1440 start_secondary+0x60/0x700 start_ap+0x750/0x780 Fixed BSP b0 value from CPU 1 As I understand interrupts are not enabled yet and system has a lot of memory. There is little chance to sleep and switch to GFP_ATOMIC should be a no-op. Link: https://lkml.kernel.org/r/20210315085045.204414-1-slyfox@gentoo.org Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* squashfs: fix xattr id and id lookup sanity checksPhillip Lougher2021-03-252-4/+8
| | | | | | | | | | | | | The checks for maximum metadata block size is missing SQUASHFS_BLOCK_OFFSET (the two byte length count). Link: https://lkml.kernel.org/r/2069685113.2081245.1614583677427@webmail.123-reg.co.uk Fixes: f37aa4c7366e23f ("squashfs: add more sanity checks in id lookup") Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Cc: Sean Nyekjaer <sean@geanix.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* squashfs: fix inode lookup sanity checksSean Nyekjaer2021-03-252-2/+7
| | | | | | | | | | | | | | | | When mouting a squashfs image created without inode compression it fails with: "unable to read inode lookup table" It turns out that the BLOCK_OFFSET is missing when checking the SQUASHFS_METADATA_SIZE agaist the actual size. Link: https://lkml.kernel.org/r/20210226092903.1473545-1-sean@geanix.com Fixes: eabac19e40c0 ("squashfs: add more sanity checks in inode lookup") Signed-off-by: Sean Nyekjaer <sean@geanix.com> Acked-by: Phillip Lougher <phillip@squashfs.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* z3fold: prevent reclaim/free race for headless pagesThomas Hebb2021-03-251-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ca0246bb97c2 ("z3fold: fix possible reclaim races") introduced the PAGE_CLAIMED flag "to avoid racing on a z3fold 'headless' page release." By atomically testing and setting the bit in each of z3fold_free() and z3fold_reclaim_page(), a double-free was avoided. However, commit dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim") appears to have unintentionally broken this behavior by moving the PAGE_CLAIMED check in z3fold_reclaim_page() to after the page lock gets taken, which only happens for non-headless pages. For headless pages, the check is now skipped entirely and races can occur again. I have observed such a race on my system: page:00000000ffbd76b7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x165316 flags: 0x2ffff0000000000() raw: 02ffff0000000000 ffffea0004535f48 ffff8881d553a170 0000000000000000 raw: 0000000000000000 0000000000000011 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:707! invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 2 PID: 291928 Comm: kworker/2:0 Tainted: G B 5.10.7-arch1-1-kasan #1 Hardware name: Gigabyte Technology Co., Ltd. H97N-WIFI/H97N-WIFI, BIOS F9b 03/03/2016 Workqueue: zswap-shrink shrink_worker RIP: 0010:__free_pages+0x10a/0x130 Code: c1 e7 06 48 01 ef 45 85 e4 74 d1 44 89 e6 31 d2 41 83 ec 01 e8 e7 b0 ff ff eb da 48 c7 c6 e0 32 91 88 48 89 ef e8 a6 89 f8 ff <0f> 0b 4c 89 e7 e8 fc 79 07 00 e9 33 ff ff ff 48 89 ef e8 ff 79 07 RSP: 0000:ffff88819a2ffb98 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffea000594c5a8 RCX: 0000000000000000 RDX: 1ffffd4000b298b7 RSI: 0000000000000000 RDI: ffffea000594c5b8 RBP: ffffea000594c580 R08: 000000000000003e R09: ffff8881d5520bbb R10: ffffed103aaa4177 R11: 0000000000000001 R12: ffffea000594c5b4 R13: 0000000000000000 R14: ffff888165316000 R15: ffffea000594c588 FS: 0000000000000000(0000) GS:ffff8881d5500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7c8c3654d8 CR3: 0000000103f42004 CR4: 00000000001706e0 Call Trace: z3fold_zpool_shrink+0x9b6/0x1240 shrink_worker+0x35/0x90 process_one_work+0x70c/0x1210 worker_thread+0x539/0x1200 kthread+0x330/0x400 ret_from_fork+0x22/0x30 Modules linked in: rfcomm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ccm algif_aead des_generic libdes ecb algif_skcipher cmac bnep md4 algif_hash af_alg vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iwlmvm hid_logitech_hidpp kvm at24 mac80211 snd_hda_codec_realtek iTCO_wdt snd_hda_codec_generic intel_pmc_bxt snd_hda_codec_hdmi ledtrig_audio iTCO_vendor_support mei_wdt mei_hdcp snd_hda_intel snd_intel_dspcfg libarc4 soundwire_intel irqbypass iwlwifi soundwire_generic_allocation rapl soundwire_cadence intel_cstate snd_hda_codec intel_uncore btusb joydev mousedev snd_usb_audio pcspkr btrtl uvcvideo nouveau btbcm i2c_i801 btintel snd_hda_core videobuf2_vmalloc i2c_smbus snd_usbmidi_lib videobuf2_memops bluetooth snd_hwdep soundwire_bus snd_soc_rt5640 videobuf2_v4l2 cfg80211 snd_soc_rl6231 videobuf2_common snd_rawmidi lpc_ich alx videodev mdio snd_seq_device snd_soc_core mc ecdh_generic mxm_wmi mei_me hid_logitech_dj wmi snd_compress e1000e ac97_bus mei ttm rfkill snd_pcm_dmaengine ecc snd_pcm snd_timer snd soundcore mac_hid acpi_pad pkcs8_key_parser it87 hwmon_vid crypto_user fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted tpm rng_core usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci xhci_pci_renesas i915 video intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm agpgart ---[ end trace 126d646fc3dc0ad8 ]--- To fix the issue, re-add the earlier test and set in the case where we have a headless page. Link: https://lkml.kernel.org/r/c8106dbe6d8390b290cd1d7f873a2942e805349e.1615452048.git.tommyhebb@gmail.com Fixes: dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim") Signed-off-by: Thomas Hebb <tommyhebb@gmail.com> Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Jongseok Kim <ks77sj@gmail.com> Cc: Snild Dolkow <snild@sony.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* selftests/vm: fix out-of-tree buildRong Chen2021-03-251-2/+2
| | | | | | | | | | | | | When building out-of-tree, attempting to make target from $(OUTPUT) directory: make[1]: *** No rule to make target '$(OUTPUT)/protection_keys.c', needed by '$(OUTPUT)/protection_keys_32'. Link: https://lkml.kernel.org/r/20210315094700.522753-1-rong.a.chen@intel.com Signed-off-by: Rong Chen <rong.a.chen@intel.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/mmu_notifiers: ensure range_end() is paired with range_start()Sean Christopherson2021-03-252-5/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If one or more notifiers fails .invalidate_range_start(), invoke .invalidate_range_end() for "all" notifiers. If there are multiple notifiers, those that did not fail are expecting _start() and _end() to be paired, e.g. KVM's mmu_notifier_count would become imbalanced. Disallow notifiers that can fail _start() from implementing _end() so that it's unnecessary to either track which notifiers rejected _start(), or had already succeeded prior to a failed _start(). Note, the existing behavior of calling _start() on all notifiers even after a previous notifier failed _start() was an unintented "feature". Make it canon now that the behavior is depended on for correctness. As of today, the bug is likely benign: 1. The only caller of the non-blocking notifier is OOM kill. 2. The only notifiers that can fail _start() are the i915 and Nouveau drivers. 3. The only notifiers that utilize _end() are the SGI UV GRU driver and KVM. 4. The GRU driver will never coincide with the i195/Nouveau drivers. 5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the _guest_, and the guest is already doomed due to being an OOM victim. Fix the bug now to play nice with future usage, e.g. KVM has a potential use case for blocking memslot updates in KVM while an invalidation is in-progress, and failure to unblock would result in said updates being blocked indefinitely and hanging. Found by inspection. Verified by adding a second notifier in KVM that periodically returns -EAGAIN on non-blockable ranges, triggering OOM, and observing that KVM exits with an elevated notifier count. Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") Signed-off-by: Sean Christopherson <seanjc@google.com> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: David Rientjes <rientjes@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kasan: fix per-page tags for non-page_alloc pagesAndrey Konovalov2021-03-251-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To allow performing tag checks on page_alloc addresses obtained via page_address(), tag-based KASAN modes store tags for page_alloc allocations in page->flags. Currently, the default tag value stored in page->flags is 0x00. Therefore, page_address() returns a 0x00ffff... address for pages that were not allocated via page_alloc. This might cause problems. A particular case we encountered is a conflict with KFENCE. If a KFENCE-allocated slab object is being freed via kfree(page_address(page) + offset), the address passed to kfree() will get tagged with 0x00 (as slab pages keep the default per-page tags). This leads to is_kfence_address() check failing, and a KFENCE object ending up in normal slab freelist, which causes memory corruptions. This patch changes the way KASAN stores tag in page-flags: they are now stored xor'ed with 0xff. This way, KASAN doesn't need to initialize per-page flags for every created page, which might be slow. With this change, page_address() returns natively-tagged (with 0xff) pointers for pages that didn't have tags set explicitly. This patch fixes the encountered conflict with KFENCE and prevents more similar issues that can occur in the future. Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappingsMiaohe Lin2021-03-253-8/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation of hugetlb_cgroup for shared mappings could have different behavior. Consider the following two scenarios: 1.Assume initial css reference count of hugetlb_cgroup is 1: 1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference count is 2 associated with 1 file_region. 1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference count is 3 associated with 2 file_region. 1.3 coalesce_file_region will coalesce these two file_regions into one. So css reference count is 3 associated with 1 file_region now. 2.Assume initial css reference count of hugetlb_cgroup is 1 again: 2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference count is 2 associated with 1 file_region. Therefore, we might have one file_region while holding one or more css reference counts. This inconsistency could lead to imbalanced css_get() and css_put() pair. If we do css_put one by one (i.g. hole punch case), scenario 2 would put one more css reference. If we do css_put all together (i.g. truncate case), scenario 1 will leak one css reference. The imbalanced css_get() and css_put() pair would result in a non-zero reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup directory is removed __but__ associated resource is not freed. This might result in OOM or can not create a new hugetlb cgroup in a busy workload ultimately. In order to fix this, we have to make sure that one file_region must hold exactly one css reference. So in coalesce_file_region case, we should release one css reference before coalescence. Also only put css reference when the entire file_region is removed. The last thing to note is that the caller of region_add() will only hold one reference to h_cg->css for the whole contiguous reservation region. But this area might be scattered when there are already some file_regions reside in it. As a result, many file_regions may share only one h_cg->css reference. In order to ensure that one file_region must hold exactly one css reference, we should do css_get() for each file_region and release the reference held by caller when they are done. [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings] Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings") Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR) Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'linux-kselftest-kunit-fixes-5.12-rc5.1' of ↵Linus Torvalds2021-03-232-1/+3
|\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit fixes from Shuah Khan: "Two fixes to the kunit tool from David Gow" * tag 'linux-kselftest-kunit-fixes-5.12-rc5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: tool: Disable PAGE_POISONING under --alltests kunit: tool: Fix a python tuple typing error
| * kunit: tool: Disable PAGE_POISONING under --alltestsDavid Gow2021-03-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kunit_tool maintains a list of config options which are broken under UML, which we exclude from an otherwise 'make ARCH=um allyesconfig' build used to run all tests with the --alltests option. Something in UML allyesconfig is causing segfaults when page poisining is enabled (and is poisoning with a non-zero value). Previously, this didn't occur, as allyesconfig enabled the CONFIG_PAGE_POISONING_ZERO option, which worked around the problem by zeroing memory. This option has since been removed, and memory is now poisoned with 0xAA, which triggers segfaults in many different codepaths, preventing UML from booting. Note that we have to disable both CONFIG_PAGE_POISONING and CONFIG_DEBUG_PAGEALLOC, as the latter will 'select' the former on architectures (such as UML) which don't implement __kernel_map_pages(). Ideally, we'd fix this properly by tracking down the real root cause, but since this is breaking KUnit's --alltests feature, it's worth disabling there in the meantime so the kernel can boot to the point where tests can actually run. Fixes: f289041ed4cf ("mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO") Signed-off-by: David Gow <davidgow@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
| * kunit: tool: Fix a python tuple typing errorDavid Gow2021-03-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The first argument to namedtuple() should match the name of the type, which wasn't the case for KconfigEntryBase. Fixing this is enough to make mypy show no python typing errors again. Fixes 97752c39bd ("kunit: kunit_tool: Allow .kunitconfig to disable config items") Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Daniel Latypov <dlatypov@google.com> Acked-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* | Merge tag 'selinux-pr-20210322' of ↵Linus Torvalds2021-03-223-41/+59
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux fixes from Paul Moore: "Three SELinux patches: - Fix a problem where a local variable is used outside its associated function. Thankfully this can only be triggered by reloading the SELinux policy, which is a restricted operation for other obvious reasons. - Fix some incorrect, and inconsistent, audit and printk messages when loading the SELinux policy. All three patches are relatively minor and have been through our testing with no failures" * tag 'selinux-pr-20210322' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinuxfs: unify policy load error reporting selinux: fix variable scope issue in live sidtab conversion selinux: don't log MAC_POLICY_LOAD record on failed policy load
| * | selinuxfs: unify policy load error reportingOndrej Mosnacek2021-03-191-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's drop the pr_err()s from sel_make_policy_nodes() and just add one pr_warn_ratelimited() call to the sel_make_policy_nodes() error path in sel_write_load(). Changing from error to warning makes sense, since after 02a52c5c8c3b ("selinux: move policy commit after updating selinuxfs"), this error path no longer leads to a broken selinuxfs tree (it's just kept in the original state and policy load is aborted). I also added _ratelimited to be consistent with the other prtin in the same function (it's probably not necessary, but can't really hurt... there are likely more important error messages to be printed when filesystem entry creation starts erroring out). Suggested-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
| * | selinux: fix variable scope issue in live sidtab conversionOndrej Mosnacek2021-03-193-33/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 02a52c5c8c3b ("selinux: move policy commit after updating selinuxfs") moved the selinux_policy_commit() call out of security_load_policy() into sel_write_load(), which caused a subtle yet rather serious bug. The problem is that security_load_policy() passes a reference to the convert_params local variable to sidtab_convert(), which stores it in the sidtab, where it may be accessed until the policy is swapped over and RCU synchronized. Before 02a52c5c8c3b, selinux_policy_commit() was called directly from security_load_policy(), so the convert_params pointer remained valid all the way until the old sidtab was destroyed, but now that's no longer the case and calls to sidtab_context_to_sid() on the old sidtab after security_load_policy() returns may cause invalid memory accesses. This can be easily triggered using the stress test from commit ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve performance"): ``` function rand_cat() { echo $(( $RANDOM % 1024 )) } function do_work() { while true; do echo -n "system_u:system_r:kernel_t:s0:c$(rand_cat),c$(rand_cat)" \ >/sys/fs/selinux/context 2>/dev/null || true done } do_work >/dev/null & do_work >/dev/null & do_work >/dev/null & while load_policy; do echo -n .; sleep 0.1; done kill %1 kill %2 kill %3 ``` Fix this by allocating the temporary sidtab convert structures dynamically and passing them among the selinux_policy_{load,cancel,commit} functions. Fixes: 02a52c5c8c3b ("selinux: move policy commit after updating selinuxfs") Cc: stable@vger.kernel.org Tested-by: Tyler Hicks <tyhicks@linux.microsoft.com> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> [PM: merge fuzz in security.h and services.c] Signed-off-by: Paul Moore <paul@paul-moore.com>
| * | selinux: don't log MAC_POLICY_LOAD record on failed policy loadOndrej Mosnacek2021-03-191-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If sel_make_policy_nodes() fails, we should jump to 'out', not 'out1', as the latter would incorrectly log an MAC_POLICY_LOAD audit record, even though the policy hasn't actually been reloaded. The 'out1' jump label now becomes unused and can be removed. Fixes: 02a52c5c8c3b ("selinux: move policy commit after updating selinuxfs") Cc: stable@vger.kernel.org Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
* | | Linux 5.12-rc4v5.12-rc4Linus Torvalds2021-03-211-1/+1
| | |
* | | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2021-03-2111-72/+168
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous ext4 bug fixes for v5.12" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: initialize ret to suppress smatch warning ext4: stop inode update before return ext4: fix rename whiteout with fast commit ext4: fix timer use-after-free on failed mount ext4: fix potential error in ext4_do_update_inode ext4: do not try to set xattr into ea_inode if value is empty ext4: do not iput inode under running transaction in ext4_rename() ext4: find old entry again if failed to rename whiteout ext4: fix error handling in ext4_end_enable_verity() ext4: fix bh ref count on error paths fs/ext4: fix integer overflow in s_log_groups_per_flex ext4: add reclaim checks to xattr code ext4: shrink race window in ext4_should_retry_alloc()
| * | | ext4: initialize ret to suppress smatch warningTheodore Ts'o2021-03-211-1/+1
| | | | | | | | | | | | | | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: stop inode update before returnPan Bian2021-03-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inode update should be stopped before returing the error code. Signed-off-by: Pan Bian <bianpan2016@163.com> Link: https://lore.kernel.org/r/20210117085732.93788-1-bianpan2016@163.com Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Cc: stable@kernel.org Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: fix rename whiteout with fast commitHarshad Shirwadkar2021-03-213-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds rename whiteout support in fast commits. Note that the whiteout object that gets created is actually char device. Which imples, the function ext4_inode_journal_mode(struct inode *inode) would return "JOURNAL_DATA" for this inode. This has a consequence in fast commit code that it will make creation of the whiteout object a fast-commit ineligible behavior and thus will fall back to full commits. With this patch, this can be observed by running fast commits with rename whiteout and seeing the stats generated by ext4_fc_stats tracepoint as follows: ext4_fc_stats: dev 254:32 fc ineligible reasons: XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0, RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16; num_commits:6, ineligible: 6, numblks: 3 So in short, this patch guarantees that in case of rename whiteout, we fall back to full commits. Amir mentioned that instead of creating a new whiteout object for every rename, we can create a static whiteout object with irrelevant nlink. That will make fast commits to not fall back to full commit. But until this happens, this patch will ensure correctness by falling back to full commits. Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Cc: stable@kernel.org Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: fix timer use-after-free on failed mountJan Kara2021-03-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When filesystem mount fails because of corrupted filesystem we first cancel the s_err_report timer reminding fs errors every day and only then we flush s_error_work. However s_error_work may report another fs error and re-arm timer thus resulting in timer use-after-free. Fix the problem by first flushing the work and only after that canceling the s_err_report timer. Reported-by: syzbot+628472a2aac693ab0fcd@syzkaller.appspotmail.com Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210315165906.2175-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: fix potential error in ext4_do_update_inodeShijie Luo2021-03-211-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If set_large_file = 1 and errors occur in ext4_handle_dirty_metadata(), the error code will be overridden, go to out_brelse to avoid this situation. Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Link: https://lore.kernel.org/r/20210312065051.36314-1-luoshijie1@huawei.com Cc: stable@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: do not try to set xattr into ea_inode if value is emptyzhangyi (F)2021-03-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot report a warning that ext4 may create an empty ea_inode if set an empty extent attribute to a file on the file system which is no free blocks left. WARNING: CPU: 6 PID: 10667 at fs/ext4/xattr.c:1640 ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640 ... Call trace: ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640 ext4_xattr_block_set+0x1d0/0x1b1c fs/ext4/xattr.c:1942 ext4_xattr_set_handle+0x8a0/0xf1c fs/ext4/xattr.c:2390 ext4_xattr_set+0x120/0x1f0 fs/ext4/xattr.c:2491 ext4_xattr_trusted_set+0x48/0x5c fs/ext4/xattr_trusted.c:37 __vfs_setxattr+0x208/0x23c fs/xattr.c:177 ... Now, ext4 try to store extent attribute into an external inode if ext4_xattr_block_set() return -ENOSPC, but for the case of store an empty extent attribute, store the extent entry into the extent attribute block is enough. A simple reproduce below. fallocate test.img -l 1M mkfs.ext4 -F -b 2048 -O ea_inode test.img mount test.img /mnt dd if=/dev/zero of=/mnt/foo bs=2048 count=500 setfattr -n "user.test" /mnt/foo Reported-by: syzbot+98b881fdd8ebf45ab4ae@syzkaller.appspotmail.com Fixes: 9c6e7853c531 ("ext4: reserve space for xattr entries/names") Cc: stable@kernel.org Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210305120508.298465-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: do not iput inode under running transaction in ext4_rename()zhangyi (F)2021-03-211-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ext4_rename(), when RENAME_WHITEOUT failed to add new entry into directory, it ends up dropping new created whiteout inode under the running transaction. After commit <9b88f9fb0d2> ("ext4: Do not iput inode under running transaction"), we follow the assumptions that evict() does not get called from a transaction context but in ext4_rename() it breaks this suggestion. Although it's not a real problem, better to obey it, so this patch add inode to orphan list and stop transaction before final iput(). Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210303131703.330415-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: find old entry again if failed to rename whiteoutzhangyi (F)2021-03-211-2/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we failed to add new entry on rename whiteout, we cannot reset the old->de entry directly, because the old->de could have moved from under us during make indexed dir. So find the old entry again before reset is needed, otherwise it may corrupt the filesystem as below. /dev/sda: Entry '00000001' in ??? (12) has deleted/unused inode 15. CLEARED. /dev/sda: Unattached inode 75 /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. Fixes: 6b4b8e6b4ad ("ext4: fix bug for rename with RENAME_WHITEOUT") Cc: stable@vger.kernel.org Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210303131703.330415-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: fix error handling in ext4_end_enable_verity()Eric Biggers2021-03-111-34/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4 didn't properly clean up if verity failed to be enabled on a file: - It left verity metadata (pages past EOF) in the page cache, which would be exposed to userspace if the file was later extended. - It didn't truncate the verity metadata at all (either from cache or from disk) if an error occurred while setting the verity bit. Fix these bugs by adding a call to truncate_inode_pages() and ensuring that we truncate the verity metadata (both from cache and from disk) in all error paths. Also rework the code to cleanly separate the success path from the error paths, which makes it much easier to understand. Reported-by: Yunlei He <heyunlei@hihonor.com> Fixes: c93d8f885809 ("ext4: add basic fs-verity support") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20210302200420.137977-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: fix bh ref count on error pathsZhaolong Zhang2021-03-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __ext4_journalled_writepage should drop bhs' ref count on error paths Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com> Link: https://lore.kernel.org/r/1614678151-70481-1-git-send-email-zhangzl2013@126.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | fs/ext4: fix integer overflow in s_log_groups_per_flexSabyrzhan Tasbolatov2021-03-061-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when 1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX, where sbi->s_mb_prefetch is unsigned integer type. 32 is the maximum allowed power of s_log_groups_per_flex. Following if check will also trigger UBSAN shift-out-of-bound: if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) { So I'm checking it against the raw number, perhaps there is another way to calculate UINT_MAX max power. Also use min_t as to make sure it's uint type. [1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24 shift exponent 60 is too large for 32-bit type 'int' Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x137/0x1be lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline] ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898 ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983 Reported-by: syzbot+a8b4b0c60155e87e9484@syzkaller.appspotmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210224095800.3350002-1-snovitoll@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: add reclaim checks to xattr codeJan Kara2021-03-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot is reporting that ext4 can enter fs reclaim from kvmalloc() while the transaction is started like: fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340 might_alloc include/linux/sched/mm.h:193 [inline] slab_pre_alloc_hook mm/slab.h:493 [inline] slab_alloc_node mm/slub.c:2817 [inline] __kmalloc_node+0x5f/0x430 mm/slub.c:4015 kmalloc_node include/linux/slab.h:575 [inline] kvmalloc_node+0x61/0xf0 mm/util.c:587 kvmalloc include/linux/mm.h:781 [inline] ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline] ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline] ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649 ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224 ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380 ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493 This should be impossible since transaction start sets PF_MEMALLOC_NOFS. Add some assertions to the code to catch if something isn't working as expected early. Link: https://lore.kernel.org/linux-ext4/000000000000563a0205bafb7970@google.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210222171626.21884-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: shrink race window in ext4_should_retry_alloc()Eric Whitney2021-03-064-12/+39
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it fails at significant rates on the two test scenarios that disable delayed allocation (ext3conv and data_journal) and force actual block allocation for the fallocate and pwrite functions in the test. The failure rate on 5.10 for both ext3conv and data_journal on one test system typically runs about 85%. On 5.11, the failure rate on ext3conv sometimes drops to as low as 1% while the rate on data_journal increases to nearly 100%. The observed failures are largely due to ext4_should_retry_alloc() cutting off block allocation retries when s_mb_free_pending (used to indicate that a transaction in progress will free blocks) is 0. However, free space is usually available when this occurs during runs of generic/371. It appears that a thread attempting to allocate blocks is just missing transaction commits in other threads that increase the free cluster count and reset s_mb_free_pending while the allocating thread isn't running. Explicitly testing for free space availability avoids this race. The current code uses a post-increment operator in the conditional expression that determines whether the retry limit has been exceeded. This means that the conditional expression uses the value of the retry counter before it's increased, resulting in an extra retry cycle. The current code actually retries twice before hitting its retry limit rather than once. Increasing the retry limit to 3 from the current actual maximum retry count of 2 in combination with the change described above reduces the observed failure rate to less that 0.1% on both ext3conv and data_journal with what should be limited impact on users sensitive to the overhead caused by retries. A per filesystem percpu counter exported via sysfs is added to allow users or developers to track the number of times the retry limit is exceeded without resorting to debugging methods. This should provide some insight into worst case retry behavior. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | Merge tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-blockLinus Torvalds2021-03-213-7/+31
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring followup fixes from Jens Axboe: - The SIGSTOP change from Eric, so we properly ignore that for PF_IO_WORKER threads. - Disallow sending signals to PF_IO_WORKER threads in general, we're not interested in having them funnel back to the io_uring owning task. - Stable fix from Stefan, ensuring we properly break links for short send/sendmsg recv/recvmsg if MSG_WAITALL is set. - Catch and loop when needing to run task_work before a PF_IO_WORKER threads goes to sleep. * tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-block: io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL io-wq: ensure task is running before processing task_work signal: don't allow STOP on PF_IO_WORKER threads signal: don't allow sending any signals to PF_IO_WORKER threads
| * | | io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with ↵Stefan Metzmacher2021-03-211-4/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MSG_WAITALL Without that it's not safe to use them in a linked combination with others. Now combinations like IORING_OP_SENDMSG followed by IORING_OP_SPLICE should be possible. We already handle short reads and writes for the following opcodes: - IORING_OP_READV - IORING_OP_READ_FIXED - IORING_OP_READ - IORING_OP_WRITEV - IORING_OP_WRITE_FIXED - IORING_OP_WRITE - IORING_OP_SPLICE - IORING_OP_TEE Now we have it for these as well: - IORING_OP_SENDMSG - IORING_OP_SEND - IORING_OP_RECVMSG - IORING_OP_RECV For IORING_OP_RECVMSG we also check for the MSG_TRUNC and MSG_CTRUNC flags in order to call req_set_fail_links(). There might be applications arround depending on the behavior that even short send[msg]()/recv[msg]() retuns continue an IOSQE_IO_LINK chain. It's very unlikely that such applications pass in MSG_WAITALL, which is only defined in 'man 2 recvmsg', but not in 'man 2 sendmsg'. It's expected that the low level sock_sendmsg() call just ignores MSG_WAITALL, as MSG_ZEROCOPY is also ignored without explicitly set SO_ZEROCOPY. We also expect the caller to know about the implicit truncation to MAX_RW_COUNT, which we don't detect. cc: netdev@vger.kernel.org Link: https://lore.kernel.org/r/c4e1a4cc0d905314f4d5dc567e65a7b09621aab3.1615908477.git.metze@samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io-wq: ensure task is running before processing task_workJens Axboe2021-03-211-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark the current task as running if we need to run task_work from the io-wq threads as part of work handling. If that is the case, then return as such so that the caller can appropriately loop back and reset if it was part of a going-to-sleep flush. Fixes: 3bfe6106693b ("io-wq: fork worker threads from original task") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | signal: don't allow STOP on PF_IO_WORKER threadsEric W. Biederman2021-03-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like we don't allow normal signals to IO threads, don't deliver a STOP to a task that has PF_IO_WORKER set. The IO threads don't take signals in general, and have no means of flushing out a stop either. Longer term, we may want to look into allowing stop of these threads, as it relates to eg process freezing. For now, this prevents a spin issue if a SIGSTOP is delivered to the parent task. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | signal: don't allow sending any signals to PF_IO_WORKER threadsJens Axboe2021-03-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | They don't take signals individually, and even if they share signals with the parent task, don't allow them to be delivered through the worker thread. Linux does allow this kind of behavior for regular threads, but it's really a compatability thing that we need not care about for the IO threads. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge tag 'staging-5.12-rc4' of ↵Linus Torvalds2021-03-2114-48/+75
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging and IIO driver fixes from Greg KH: "Some small staging and IIO driver fixes: - MAINTAINERS changes for the move of the staging mailing list - comedi driver fixes to get request_irq() to work correctly - counter driver fixes for reported issues with iio devices - tiny iio driver fixes for reported issues. All of these have been in linux-next with no reported problems" * tag 'staging-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: vt665x: fix alignment constraints staging: comedi: cb_pcidas64: fix request_irq() warn staging: comedi: cb_pcidas: fix request_irq() warn MAINTAINERS: move the staging subsystem to lists.linux.dev MAINTAINERS: move some real subsystems off of the staging mailing list iio: gyro: mpu3050: Fix error handling in mpu3050_trigger_handler iio: hid-sensor-temperature: Fix issues of timestamp channel iio: hid-sensor-humidity: Fix alignment issue of timestamp channel counter: stm32-timer-cnt: fix ceiling miss-alignment with reload register counter: stm32-timer-cnt: fix ceiling write max value counter: stm32-timer-cnt: Report count function when SLAVE_MODE_DISABLED iio: adc: ab8500-gpadc: Fix off by 10 to 3 iio:adc:stm32-adc: Add HAS_IOMEM dependency iio: adis16400: Fix an error code in adis16400_initial_setup() iio: adc: adi-axi-adc: add proper Kconfig dependencies iio: adc: ad7949: fix wrong ADC result due to incorrect bit mask iio: hid-sensor-prox: Fix scale not correct issue iio:adc:qcom-spmi-vadc: add default scale to LR_MUX2_BAT_ID channel
| * | | | staging: vt665x: fix alignment constraintsEdmundo Carmona Antoranz2021-03-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Removing 2 instances of alignment warnings drivers/staging/vt6655/rxtx.h:153:1: warning: alignment 1 of ‘struct vnt_cts’ is less than 2 [-Wpacked-not-aligned] drivers/staging/vt6655/rxtx.h:163:1: warning: alignment 1 of ‘struct vnt_cts_fb’ is less than 2 [-Wpacked-not-aligned] The root cause seems to be that _because_ struct ieee80211_cts is marked as __aligned(2), this requires any encapsulating struct to also have an alignment of 2. Fixes: 2faf12c57efe ("staging: vt665x: fix alignment constraints") Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Edmundo Carmona Antoranz <eantoranz@gmail.com> Link: https://lore.kernel.org/r/20210316181736.2553318-1-eantoranz@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | staging: comedi: cb_pcidas64: fix request_irq() warnTong Zhang2021-03-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | request_irq() wont accept a name which contains slash so we need to repalce it with something else -- otherwise it will trigger a warning and the entry in /proc/irq/ will not be created since the .name might be used by userspace and we don't want to break userspace, so we are changing the parameters passed to request_irq() [ 1.565966] name 'pci-das6402/16' [ 1.566149] WARNING: CPU: 0 PID: 184 at fs/proc/generic.c:180 __xlate_proc_name+0x93/0xb0 [ 1.568923] RIP: 0010:__xlate_proc_name+0x93/0xb0 [ 1.574200] Call Trace: [ 1.574722] proc_mkdir+0x18/0x20 [ 1.576629] request_threaded_irq+0xfe/0x160 [ 1.576859] auto_attach+0x60a/0xc40 [cb_pcidas64] Suggested-by: Ian Abbott <abbotti@mev.co.uk> Reviewed-by: Ian Abbott <abbotti@mev.co.uk> Signed-off-by: Tong Zhang <ztong0001@gmail.com> Link: https://lore.kernel.org/r/20210315195814.4692-1-ztong0001@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | staging: comedi: cb_pcidas: fix request_irq() warnTong Zhang2021-03-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | request_irq() wont accept a name which contains slash so we need to repalce it with something else -- otherwise it will trigger a warning and the entry in /proc/irq/ will not be created since the .name might be used by userspace and we don't want to break userspace, so we are changing the parameters passed to request_irq() [ 1.630764] name 'pci-das1602/16' [ 1.630950] WARNING: CPU: 0 PID: 181 at fs/proc/generic.c:180 __xlate_proc_name+0x93/0xb0 [ 1.634009] RIP: 0010:__xlate_proc_name+0x93/0xb0 [ 1.639441] Call Trace: [ 1.639976] proc_mkdir+0x18/0x20 [ 1.641946] request_threaded_irq+0xfe/0x160 [ 1.642186] cb_pcidas_auto_attach+0xf4/0x610 [cb_pcidas] Suggested-by: Ian Abbott <abbotti@mev.co.uk> Reviewed-by: Ian Abbott <abbotti@mev.co.uk> Signed-off-by: Tong Zhang <ztong0001@gmail.com> Link: https://lore.kernel.org/r/20210315195914.4801-1-ztong0001@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | MAINTAINERS: move the staging subsystem to lists.linux.devGreg Kroah-Hartman2021-03-161-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The drivers/staging/ tree has a new mailing list, linux-staging@lists.linux.dev, so move the MAINTAINER entry to point to it so that we get patches sent to the proper place. There was no need to specify a list for the hikey9xx driver, the tools pick up the "base" list for drivers/staging/* so remove that line to make the file simpler. Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/20210316102311.182375-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | MAINTAINERS: move some real subsystems off of the staging mailing listGreg Kroah-Hartman2021-03-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The VME and Android drivers still have their MAINTAINERS entries pointing to the "driverdevel" mailing list, due to them having their codebase move out of the drivers/staging/ directory, but no one remembered to change the mailing list entries. Move them both to linux-kernel for lack of a more specific place at the moment. These are both low-volume areas of the kernel, so this shouldn't be an issue. Cc: Martyn Welch <martyn@welchs.me.uk> Cc: Manohar Vanga <manohar.vanga@gmail.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Todd Kjos <tkjos@android.com> Cc: Martijn Coenen <maco@android.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Christian Brauner <christian@brauner.io> Cc: Hridya Valsaraju <hridya@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://lore.kernel.org/r/YEzE6u6U1jkBatmr@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | Merge tag 'iio-fixes-for-5.12a' of ↵Greg Kroah-Hartman2021-03-1510-40/+68
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus Jonathan writes: First set of IIO and counter fixes for the 5.12 cycle adi,ad7949 * Fix a wrong bitmask that could lead to an undefined bit being included. adi,adi-axi-adc * Add missing Kconfig dependencies adi,adis16400 * Wrong error code handling in adis16400 that could lead to failed probe. hid-sensor-humidity, temperature * Fix alignment and space for timestamp channel. hid-sensor-prox * Fix an issue with handling of exponent on the channel scaling. invensense,mpu3050 * Fix a hole in error handling. qcom,spi-vadc * Correct scaling st,ab8500-adc * Fix wrong scaling (by factor of 1000) st,stm32-adc * Add missing HAS_IOMEM dependency st,stm32-timer-cnt * Report count when running off internal clock * Fix issue with not checking ceiling before trying to write to hardware * Ensure driver doesn't have stashed state which doesn't match hardware by rereading from hardware in a slow path. * tag 'iio-fixes-for-5.12a' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio: iio: gyro: mpu3050: Fix error handling in mpu3050_trigger_handler iio: hid-sensor-temperature: Fix issues of timestamp channel iio: hid-sensor-humidity: Fix alignment issue of timestamp channel counter: stm32-timer-cnt: fix ceiling miss-alignment with reload register counter: stm32-timer-cnt: fix ceiling write max value counter: stm32-timer-cnt: Report count function when SLAVE_MODE_DISABLED iio: adc: ab8500-gpadc: Fix off by 10 to 3 iio:adc:stm32-adc: Add HAS_IOMEM dependency iio: adis16400: Fix an error code in adis16400_initial_setup() iio: adc: adi-axi-adc: add proper Kconfig dependencies iio: adc: ad7949: fix wrong ADC result due to incorrect bit mask iio: hid-sensor-prox: Fix scale not correct issue iio:adc:qcom-spmi-vadc: add default scale to LR_MUX2_BAT_ID channel
| | * | | | iio: gyro: mpu3050: Fix error handling in mpu3050_trigger_handlerDinghao Liu2021-03-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is one regmap_bulk_read() call in mpu3050_trigger_handler that we have caught its return value bug lack further handling. Check and terminate the execution flow just like the other three regmap_bulk_read() calls in this function. Fixes: 3904b28efb2c7 ("iio: gyro: Add driver for the MPU-3050 gyroscope") Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20210301080421.13436-1-dinghao.liu@zju.edu.cn Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
| | * | | | iio: hid-sensor-temperature: Fix issues of timestamp channelYe Xiang2021-03-061-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes 2 issues of timestamp channel: 1. This patch ensures that there is sufficient space and correct alignment for the timestamp. 2. Correct the timestamp channel scan index. Fixes: 59d0f2da3569 ("iio: hid: Add temperature sensor support") Signed-off-by: Ye Xiang <xiang.ye@intel.com> Cc: <Stable@vger.kernel.org> Link: https://lore.kernel.org/r/20210303063615.12130-4-xiang.ye@intel.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
| | * | | | iio: hid-sensor-humidity: Fix alignment issue of timestamp channelYe Xiang2021-03-061-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch ensures that, there is sufficient space and correct alignment for the timestamp. Fixes: d7ed89d5aadf ("iio: hid: Add humidity sensor support") Signed-off-by: Ye Xiang <xiang.ye@intel.com> Cc: <Stable@vger.kernel.org> Link: https://lore.kernel.org/r/20210303063615.12130-2-xiang.ye@intel.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>