summaryrefslogtreecommitdiffstats
path: root/mm/hmm.c (unfollow)
Commit message (Collapse)AuthorFilesLines
2018-07-23mm, madvise_inject_error: Let memory_failure() optionally take a page referenceDan Williams1-3/+13
The madvise_inject_error() routine uses get_user_pages() to lookup the pfn and other information for injected error, but it does not release that pin. The assumption is that failed pages should be taken out of circulation. However, for dax mappings it is not possible to take pages out of circulation since they are 1:1 physically mapped as filesystem blocks, or device-dax capacity. They also typically represent persistent memory which has an error clearing capability. In preparation for adding a special handler for dax mappings, shift the responsibility of taking the page reference to memory_failure(). I.e. drop the page reference and do not specify MF_COUNT_INCREASED to memory_failure(). Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-23mm, dev_pagemap: Do not clear ->mapping on final putDan Williams2-1/+2
MEMORY_DEVICE_FS_DAX relies on typical page semantics whereby ->mapping is only ever cleared by truncation, not final put. Without this fix dax pages may forget their mapping association at the end of every page pin event. Move this atypical behavior that HMM wants into the HMM ->page_free() callback. Cc: <stable@vger.kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Fixes: d2c997c0f145 ("fs, dax: use page->mapping...") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-20mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pagesDan Williams1-0/+8
Given that dax / device-mapped pages are never subject to page allocations remove them from consideration by the soft-offline mechanism. Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-20filesystem-dax: Set page->indexDan Williams1-3/+13
In support of enabling memory_failure() handling for filesystem-dax mappings, set ->index to the pgoff of the page. The rmap implementation requires ->index to bound the search through the vma interval tree. The index is set and cleared at dax_associate_entry() and dax_disassociate_entry() time respectively. Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-20device-dax: Set page->indexDan Williams1-0/+4
In support of enabling memory_failure() handling for device-dax mappings, set ->index to the pgoff of the page. The rmap implementation requires ->index to bound the search through the vma interval tree. The ->index value is never cleared. There is no possibility for the page to become associated with another pgoff while the device is enabled. When the device is disabled the 'struct page' array for the device is destroyed and ->index is reinitialized to zero. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-20device-dax: Enable page_mapping()Dan Williams1-17/+38
In support of enabling memory_failure() handling for device-dax mappings, set the ->mapping association of pages backing device-dax mappings. The rmap implementation requires page_mapping() to return the address_space hosting the vmas that map the page. The ->mapping pointer is never cleared. There is no possibility for the page to become associated with another address_space while the device is enabled. When the device is disabled the 'struct page' array for the device is destroyed / later reinitialized to zero. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-20device-dax: Convert to vmf_insert_mixed and vm_fault_tDan Williams3-19/+16
Use new return type vm_fault_t for fault and huge_fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Commit 1c8f422059ae ("mm: change return type to vm_fault_t") Previously vm_insert_mixed() returned an error code which driver mapped into VM_FAULT_* type. The new function vmf_insert_mixed() will replace this inefficiency by returning VM_FAULT_* type. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-02Linux 4.18-rc3v4.18-rc3Linus Torvalds1-1/+1
2018-06-29parisc: Build kernel without -ffunction-sectionsHelge Deller1-4/+0
As suggested by Nick Piggin it seems we can drop the -ffunction-sections compile flag, now that the kernel uses thin archives. Testing with 32- and 64-bit kernel showed no difference in kernel size. Suggested-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Helge Deller <deller@gmx.de>
2018-06-29sg: remove ->sg_magic memberJens Axboe4-45/+0
This was introduced more than a decade ago when sg chaining was added, but we never really caught anything with it. The scatterlist entry size can be critical, since drivers allocate it, so remove the magic member. Recently it's been triggering allocation stalls and failures in NVMe. Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29drbd: Fix drbd_request_prepare() discard handlingBart Van Assche1-2/+2
Fix the test that verifies whether bio_op(bio) represents a discard or write zeroes operation. Compile-tested only. Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Fixes: 7435e9018f91 ("drbd: zero-out partial unaligned discards on local backend") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29blk-mq: don't queue more if we get a busy returnJens Axboe1-0/+12
Some devices have different queue limits depending on the type of IO. A classic case is SATA NCQ, where some commands can queue, but others cannot. If we have NCQ commands inflight and encounter a non-queueable command, the driver returns busy. Currently we attempt to dispatch more from the scheduler, if we were able to queue some commands. But for the case where we ended up stopping due to BUSY, we should not attempt to retrieve more from the scheduler. If we do, we can get into a situation where we attempt to queue a non-queueable command, get BUSY, then successfully retrieve more commands from that scheduler and queue those. This can repeat forever, starving the non-queuable command indefinitely. Fix this by NOT attempting to pull more commands from the scheduler, if we get a BUSY return. This should also be more optimal in terms of letting requests stay in the scheduler for as long as possible, if we get a BUSY due to the regular out-of-tags condition. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29aio: mark __aio_sigset::sigmask constAvi Kivity1-1/+1
io_pgetevents() will not change the signal mask. Mark it const to make it clear and to reduce the need for casts in user code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Avi Kivity <avi@scylladb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [hch: reapply the patch that got incorrectly reverted] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-29net: handle NULL ->poll gracefullyChristoph Hellwig1-0/+2
The big aio poll revert broke various network protocols that don't implement ->poll as a patch in the aio poll serie removed sock_no_poll and made the common code handle this case. Reported-by: syzbot+57727883dbad76db2ef0@syzkaller.appspotmail.com Reported-by: syzbot+cdb0d3176b53d35ad454@syzkaller.appspotmail.com Reported-by: syzbot+2c7e8f74f8b2571c87e8@syzkaller.appspotmail.com Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Fixes: a11e1d432b51 ("Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-29i2c: gpio: initialize SCL to HIGH againWolfram Sang1-2/+2
It seems that during the conversion from gpio* to gpiod*, the initial state of SCL was wrongly switched to LOW. Fix it to be HIGH again. Fixes: 7bb75029ef34 ("i2c: gpio: Enforce open drain through gpiolib") Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
2018-06-29i2c: smbus: kill memory leak on emulated and failed DMA SMBus xfersPeter Rosin1-5/+9
If DMA safe memory was allocated, but the subsequent I2C transfer fails the memory is leaked. Plug this leak. Fixes: 8a77821e74d6 ("i2c: smbus: use DMA safe buffers for emulated SMBus transactions") Signed-off-by: Peter Rosin <peda@axentia.se> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
2018-06-29i2c: algos: bit: mention our experience about initial statesWolfram Sang1-0/+5
So, if somebody wants to re-implement this in the future, we pinpoint to a problem case. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-06-29Revert "i2c: algo-bit: init the bus to a known state"Wolfram Sang1-5/+0
This reverts commit 3e5f06bed72fe72166a6778f630241a893f67799. As per bugzilla #200045, this caused a regression. I don't really see a way to fix it without having the hardware. So, revert the patch and I will fix the issue I was seeing originally in the i2c-gpio driver itself. I couldn't find new users of this algorithm since, so there should be no one depending on the new behaviour. Reported-by: Sergey Larin <cerg2010cerg2010@mail.ru> Fixes: 3e5f06bed72f ("i2c: algo-bit: init the bus to a known state") Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Tested-by: Sergey Larin <cerg2010cerg2010@mail.ru> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
2018-06-29selinux: move user accesses in selinuxfs out of locked regionsJann Horn1-45/+33
If a user is accessing a file in selinuxfs with a pointer to a userspace buffer that is backed by e.g. a userfaultfd, the userspace access can stall indefinitely, which can block fsi->mutex if it is held. For sel_read_policy(), remove the locking, since this method doesn't seem to access anything that requires locking. For sel_read_bool(), move the user access below the locked region. For sel_write_bool() and sel_commit_bools_write(), move the user access up above the locked region. Cc: stable@vger.kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> [PM: removed an unused variable in sel_read_policy()] Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-06-28parisc: Reduce debug output in unwind codeHelge Deller1-2/+2
Signed-off-by: Helge Deller <deller@gmx.de>
2018-06-28dm: prevent DAX mounts if not supportedRoss Zwisler2-5/+5
Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX flag is set on the device's request queue to decide whether or not the device supports filesystem DAX. Really we should be using bdev_dax_supported() like filesystems do at mount time. This performs other tests like checking to make sure the dax_direct_access() path works. We also explicitly clear QUEUE_FLAG_DAX on the DM device's request queue if any of the underlying devices do not support DAX. This makes the handling of QUEUE_FLAG_DAX consistent with the setting/clearing of most other flags in dm_table_set_restrictions(). Now that bdev_dax_supported() explicitly checks for QUEUE_FLAG_DAX, this will ensure that filesystems built upon DM devices will only be able to mount with DAX if all underlying devices also support DAX. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support") Cc: stable@vger.kernel.org Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-28dax: check for QUEUE_FLAG_DAX in bdev_dax_supported()Ross Zwisler1-0/+8
Add an explicit check for QUEUE_FLAG_DAX to __bdev_dax_supported(). This is needed for DM configurations where the first element in the dm-linear or dm-stripe target supports DAX, but other elements do not. Without this check __bdev_dax_supported() will pass for such devices, letting a filesystem on that device mount with the DAX option. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Mike Snitzer <snitzer@redhat.com> Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support") Cc: stable@vger.kernel.org Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-28pmem: only set QUEUE_FLAG_DAX for fsdax modeRoss Zwisler1-1/+2
QUEUE_FLAG_DAX is an indication that a given block device supports filesystem DAX and should not be set for PMEM namespaces which are in "raw" mode. These namespaces lack struct page and are prevented from participating in filesystem DAX as of commit 569d0365f571 ("dax: require 'struct page' by default for filesystem dax"). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Mike Snitzer <snitzer@redhat.com> Fixes: 569d0365f571 ("dax: require 'struct page' by default for filesystem dax") Cc: stable@vger.kernel.org Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-28proc: add Alexey to MAINTAINERSAlexey Dobriyan1-0/+9
I know I'll regret it. Link: http://lkml.kernel.org/r/20180627194840.GA18113@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28kasan: depend on CONFIG_SLUB_DEBUGJason A. Donenfeld1-0/+1
KASAN depends on having access to some of the accounting that SLUB_DEBUG does; without it, there are immediate crashes [1]. So, the natural thing to do is to make KASAN select SLUB_DEBUG. [1] http://lkml.kernel.org/r/CAHmME9rtoPwxUSnktxzKso14iuVCWT7BE_-_8PAC=pGw1iJnQg@mail.gmail.com Link: http://lkml.kernel.org/r/20180622154623.25388-1-Jason@zx2c4.com Fixes: f9e13c0a5a33 ("slab, slub: skip unnecessary kasan_cache_shutdown()") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28include/linux/dax.h: dax_iomap_fault() returns vm_fault_tSouptick Joarder1-1/+1
Commit 1c8f422059ae ("mm: change return type to vm_fault_t") missed a conversion. It's not a big problem at present because mainline is still using typedef int vm_fault_t; Fixes: 1c8f422059ae ("mm: change return type to vm_fault_t") Link: http://lkml.kernel.org/r/20180620172046.GA27894@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28x86/e820: put !E820_TYPE_RAM regions into memblock.reservedNaoya Horiguchi1-3/+12
There is a kernel panic that is triggered when reading /proc/kpageflags on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]': BUG: unable to handle kernel paging request at fffffffffffffffe PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 RIP: 0010:stable_page_flags+0x27/0x3c0 Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7 RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202 RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0 RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001 R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0 R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10 FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0 Call Trace: kpageflags_read+0xc7/0x120 proc_reg_read+0x3c/0x60 __vfs_read+0x36/0x170 vfs_read+0x89/0x130 ksys_pread64+0x71/0x90 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7efc42e75e23 Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24 According to kernel bisection, this problem became visible due to commit f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") which changes how struct pages are initialized. Memblock layout affects the pfn ranges covered by node/zone. Consider that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the default (no memmap= given) memblock layout is like below: MEMBLOCK configuration: memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x4 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0 memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... If you give memmap=1G!4G (so it just covers memory[0x2]), the range [0x100000000-0x13fffffff] is gone: MEMBLOCK configuration: memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x3 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... This causes shrinking node 0's pfn range because it is calculated by the address range of memblock.memory. So some of struct pages in the gap range are left uninitialized. We have a function zero_resv_unavail() which does zeroing the struct pages within the reserved unavailable range (i.e. memblock.memory && !memblock.reserved). This patch utilizes it to cover all unavailable ranges by putting them into memblock.reserved. Link: http://lkml.kernel.org/r/20180615072947.GB23273@hori1.linux.bs1.fc.nec.co.jp Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Oscar Salvador <osalvador@suse.de> Tested-by: "Herton R. Krzesinski" <herton@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28slub: fix failure when we delete and create a slab cacheMikulas Patocka3-1/+14
In kernel 4.17 I removed some code from dm-bufio that did slab cache merging (commit 21bb13276768: "dm bufio: remove code that merges slab caches") - both slab and slub support merging caches with identical attributes, so dm-bufio now just calls kmem_cache_create and relies on implicit merging. This uncovered a bug in the slub subsystem - if we delete a cache and immediatelly create another cache with the same attributes, it fails because of duplicate filename in /sys/kernel/slab/. The slub subsystem offloads freeing the cache to a workqueue - and if we create the new cache before the workqueue runs, it complains because of duplicate filename in sysfs. This patch fixes the bug by moving the call of kobject_del from sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called while we hold slab_mutex - so that the sysfs entry is deleted before a cache with the same attributes could be created. Running device-mapper-test-suite with: dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/ triggered: Buffer I/O error on dev dm-0, logical block 1572848, async page read device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5 device-mapper: thin: 253:1: aborting current metadata transaction sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144' CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25 Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017 Workqueue: dm-thin do_worker [dm_thin_pool] Call Trace: dump_stack+0x5a/0x73 sysfs_warn_dup+0x58/0x70 sysfs_create_dir_ns+0x77/0x80 kobject_add_internal+0xba/0x2e0 kobject_init_and_add+0x70/0xb0 sysfs_slab_add+0xb1/0x250 __kmem_cache_create+0x116/0x150 create_cache+0xd9/0x1f0 kmem_cache_create_usercopy+0x1c1/0x250 kmem_cache_create+0x18/0x20 dm_bufio_client_create+0x1ae/0x410 [dm_bufio] dm_block_manager_create+0x5e/0x90 [dm_persistent_data] __create_persistent_data_objects+0x38/0x940 [dm_thin_pool] dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool] metadata_operation_failed+0x59/0x100 [dm_thin_pool] alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool] process_cell+0x2a3/0x550 [dm_thin_pool] do_worker+0x28d/0x8f0 [dm_thin_pool] process_one_work+0x171/0x370 worker_thread+0x49/0x3f0 kthread+0xf8/0x130 ret_from_fork+0x35/0x40 kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory. kmem_cache_create(dm_bufio_buffer-16) failed with error -17 Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28Revert mm/vmstat.c: fix vmstat_update() preemption BUGSebastian Andrzej Siewior1-2/+0
Revert commit c7f26ccfb2c3 ("mm/vmstat.c: fix vmstat_update() preemption BUG"). Steven saw a "using smp_processor_id() in preemptible" message and added a preempt_disable() section around it to keep it quiet. This is not the right thing to do it does not fix the real problem. vmstat_update() is invoked by a kworker on a specific CPU. This worker it bound to this CPU. The name of the worker was "kworker/1:1" so it should have been a worker which was bound to CPU1. A worker which can run on any CPU would have a `u' before the first digit. smp_processor_id() can be used in a preempt-enabled region as long as the task is bound to a single CPU which is the case here. If it could run on an arbitrary CPU then this is the problem we have an should seek to resolve. Not only this smp_processor_id() must not be migrated to another CPU but also refresh_cpu_vm_stats() which might access wrong per-CPU variables. Not to mention that other code relies on the fact that such a worker runs on one specific CPU only. Therefore revert that commit and we should look instead what broke the affinity mask of the kworker. Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven J. Hill <steven.hill@cavium.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28lib/percpu_ida.c: don't do alloc from per-CPU list if there is noneSebastian Andrzej Siewior1-1/+1
In commit 804209d8a009 ("lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock") I inlined alloc_local_tag() and mixed up the >= check from percpu_ida_alloc() with the one in alloc_local_tag(). Don't alloc from per-CPU freelist if ->nr_free is zero. Link: http://lkml.kernel.org/r/20180613075830.c3zeva52fuj6fxxv@linutronix.de Fixes: 804209d8a009 ("lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: David Disseldorp <ddiss@suse.de> Tested-by: David Disseldorp <ddiss@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Cc: Shaohua Li <shli@fb.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLLLinus Torvalds80-450/+301
The poll() changes were not well thought out, and completely unexplained. They also caused a huge performance regression, because "->poll()" was no longer a trivial file operation that just called down to the underlying file operations, but instead did at least two indirect calls. Indirect calls are sadly slow now with the Spectre mitigation, but the performance problem could at least be largely mitigated by changing the "->get_poll_head()" operation to just have a per-file-descriptor pointer to the poll head instead. That gets rid of one of the new indirections. But that doesn't fix the new complexity that is completely unwarranted for the regular case. The (undocumented) reason for the poll() changes was some alleged AIO poll race fixing, but we don't make the common case slower and more complex for some uncommon special case, so this all really needs way more explanations and most likely a fundamental redesign. [ This revert is a revert of about 30 different commits, not reverted individually because that would just be unnecessarily messy - Linus ] Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28arm64: dts: hikey960: Define wl1837 power capabilitiesoscardagrach1-0/+2
These properties are required for compatibility with runtime PM. Without these properties, MMC host controller will not be aware of power capabilities. When the wlcore driver attempts to power on the device, it will erroneously fail with -EACCES. This fixes a regression found here: https://lkml.org/lkml/2018/6/12/930 Fixes: 60f36637bbbd ("wlcore: sdio: allow pm to handle sdio power") Signed-off-by: Ryan Grachek <ryan@edited.us> Tested-by: John Stultz <john.stultz@linaro.org> Acked-by: John Stultz <john.stultz@linaro.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Wei Xu <xuwei5@hisilicon.com>
2018-06-28arm64: dts: hikey: Define wl1835 power capabilitiesoscardagrach1-0/+2
These properties are required for compatibility with runtime PM. Without these properties, MMC host controller will not be aware of power capabilities. When the wlcore driver attempts to power on the device, it will erroneously fail with -EACCES. Fixes: 60f36637bbbd ("wlcore: sdio: allow pm to handle sdio power") Signed-off-by: Ryan Grachek <ryan@edited.us> Tested-by: John Stultz <john.stultz@linaro.org> Acked-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Wei Xu <xuwei5@hisilicon.com>
2018-06-28block: Fix cloning of requests with a special payloadBart Van Assche1-0/+4
This patch avoids that removing a path controlled by the dm-mpath driver while mkfs is running triggers the following kernel bug: kernel BUG at block/blk-core.c:3347! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2 RIP: 0010:blk_end_request_all+0x68/0x70 Call Trace: <IRQ> dm_softirq_done+0x326/0x3d0 [dm_mod] blk_done_softirq+0x19b/0x1e0 __do_softirq+0x128/0x60d irq_exit+0x100/0x110 smp_call_function_single_interrupt+0x90/0x330 call_function_single_interrupt+0xf/0x20 </IRQ> Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-28parisc: Wire up io_pgetevents syscallHelge Deller2-1/+3
Signed-off-by: Helge Deller <deller@gmx.de>
2018-06-28parisc: Default to 4 SMP CPUsHelge Deller1-1/+1
I haven't seen any real SMP machine yet with > 4 CPUs (we don't suport SuperDomes yet), so reducing the default maximum number of CPUs may speed up various bitop functions which depend on number of CPUs in the system. bload-o-meter on a typical 64-bit kernel shows: Data: add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-3724 (-3724) Total: Before=1910404, After=1906680, chg -0.19% Code: add/remove: 0/2 grow/shrink: 42/38 up/down: 2320/-3500 (-1180) Total: Before=11053099, After=11051919, chg -0.01% Signed-off-by: Helge Deller <deller@gmx.de>
2018-06-28parisc: Convert printk(KERN_LEVEL) to pr_lvl()Andy Shevchenko1-16/+9
Convert printk(KERN_LEVEL) type of calls to pr_lvl() macros. While here, - convert printk() to pr_info() - join back string literal to be on one line - use %*phN (note, it gives 1 byte more for sake of simplicity) Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Helge Deller <deller@gmx.de>
2018-06-28parisc: Mark 16kB and 64kB page sizes BROKENHelge Deller1-2/+2
A full boot only succeeds with 4kB page sizes currently. For 16kB and 64kB page size support somone needs to fix the LBA PCI code at least, so mark those broken for now. Signed-off-by: Helge Deller <deller@gmx.de>
2018-06-28parisc: Drop struct sigaction from not exported header fileHelge Deller1-8/+0
This header file isn't exported to userspace, so there is no benefit in defining struct sigaction for userspace here. Signed-off-by: Helge Deller <deller@gmx.de>
2018-06-28nvme-rdma: fix possible double free of controller async event bufferSagi Grimberg1-2/+5
If reconnect/reset failed where the controller async event buffer was freed, we might end up freeing it again as we call nvme_rdma_destroy_admin_queue again in the remove path. Given that the sequence is guaranteed to serialize by .ctrl_stop, we simply set ctrl->async_event_sqe.data to NULL and don't free it in future visits. Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-28kconfig: loop boundary condition fixJerry James1-1/+1
If buf[-1] just happens to hold the byte 0x0A, then nread can wrap around to (size_t)-1, leading to invalid memory accesses. This has caused segmentation faults when trying to build the latest kernel snapshots for i686 in Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=1592374 Signed-off-by: Jerry James <loganjerry@gmail.com> [alexpl@fedoraproject.org: reformatted patch for submission] Signed-off-by: Alexander Ploumistos <alexpl@fedoraproject.org> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-06-28kbuild: reword help of LD_DEAD_CODE_DATA_ELIMINATIONMasahiro Yamada1-4/+3
Since commit 5d20ee3192a5 ("kbuild: Allow LD_DEAD_CODE_DATA_ELIMINATION to be selectable if enabled"), HAVE_LD_DEAD_CODE_DATA_ELIMINATION is supposed to be selected by architectures that are capable of this functionality. LD_DEAD_CODE_DATA_ELIMINATION is now users' selection. Update the help message. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-06-28kconfig: handle P_SYMBOL in print_symbol()Dirk Gouders2-0/+7
Each symbol has a property of type P_SYMBOL since commit 59e89e3ddf85 (kconfig: save location of config symbols). Handle those properties in print_symbol(). Further, place a pointer to print_symbol() in the comment above the list of known property type. Signed-off-by: Dirk Gouders <dirk@gouders.net> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-06-28vt: prevent leaking uninitialized data to userspace via /dev/vcs*Alexander Potapenko1-2/+2
KMSAN reported an infoleak when reading from /dev/vcs*: BUG: KMSAN: kernel-infoleak in vcs_read+0x18ba/0x1cc0 Call Trace: ... kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1253 copy_to_user ./include/linux/uaccess.h:184 vcs_read+0x18ba/0x1cc0 drivers/tty/vt/vc_screen.c:352 __vfs_read+0x1b2/0x9d0 fs/read_write.c:416 vfs_read+0x36c/0x6b0 fs/read_write.c:452 ... Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:189 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:315 __kmalloc+0x13a/0x350 mm/slub.c:3818 kmalloc ./include/linux/slab.h:517 vc_allocate+0x438/0x800 drivers/tty/vt/vt.c:787 con_install+0x8c/0x640 drivers/tty/vt/vt.c:2880 tty_driver_install_tty drivers/tty/tty_io.c:1224 tty_init_dev+0x1b5/0x1020 drivers/tty/tty_io.c:1324 tty_open_by_driver drivers/tty/tty_io.c:1959 tty_open+0x17b4/0x2ed0 drivers/tty/tty_io.c:2007 chrdev_open+0xc25/0xd90 fs/char_dev.c:417 do_dentry_open+0xccc/0x1440 fs/open.c:794 vfs_open+0x1b6/0x2f0 fs/open.c:908 ... Bytes 0-79 of 240 are uninitialized Consistently allocating |vc_screenbuf| with kzalloc() fixes the problem Reported-by: syzbot+17a8efdf800000@syzkaller.appspotmail.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-28serdev: fix memleak on module unloadJohan Hovold1-0/+1
Make sure to free all resources associated with the ida on module exit. Fixes: cd6484e1830b ("serdev: Introduce new bus for serial attached devices") Cc: stable <stable@vger.kernel.org> # 4.11 Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-28serial: 8250_pci: Remove stalled entries in blacklistAndy Shevchenko1-2/+0
After the commit 7d8905d06405 ("serial: 8250_pci: Enable device after we check black list") pure serial multi-port cards, such as CH355, got blacklisted and thus not being enumerated anymore. Previously, it seems, blacklisting them was on purpose to shut up pciserial_init_one() about record duplication. So, remove the entries from blacklist in order to get cards enumerated. Fixes: 7d8905d06405 ("serial: 8250_pci: Enable device after we check black list") Reported-by: Matt Turner <mattst88@gmail.com> Cc: Sergej Pupykin <ml@sergej.pp.ru> Cc: Alexandr Petrenko <petrenkoas83@gmail.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-and-Tested-by: Matt Turner <mattst88@gmail.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-28n_tty: Access echo_* variables carefully.Tetsuo Handa1-18/+24
syzbot is reporting stalls at __process_echoes() [1]. This is because since ldata->echo_commit < ldata->echo_tail becomes true for some reason, the discard loop is serving as almost infinite loop. This patch tries to avoid falling into ldata->echo_commit < ldata->echo_tail situation by making access to echo_* variables more carefully. Since reset_buffer_flags() is called without output_lock held, it should not touch echo_* variables. And omit a call to reset_buffer_flags() from n_tty_open() by using vzalloc(). Since add_echo_byte() is called without output_lock held, it needs memory barrier between storing into echo_buf[] and incrementing echo_head counter. echo_buf() needs corresponding memory barrier before reading echo_buf[]. Lack of handling the possibility of not-yet-stored multi-byte operation might be the reason of falling into ldata->echo_commit < ldata->echo_tail situation, for if I do WARN_ON(ldata->echo_commit == tail + 1) prior to echo_buf(ldata, tail + 1), the WARN_ON() fires. Also, explicitly masking with buffer for the former "while" loop, and use ldata->echo_commit > tail for the latter "while" loop. [1] https://syzkaller.appspot.com/bug?id=17f23b094cd80df750e5b0f8982c521ee6bcbf40 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+108696293d7a21ab688f@syzkaller.appspotmail.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-28n_tty: Fix stall at n_tty_receive_char_special().Tetsuo Handa1-5/+8
syzbot is reporting stalls at n_tty_receive_char_special() [1]. This is because comparison is not working as expected since ldata->read_head can change at any moment. Mitigate this by explicitly masking with buffer size when checking condition for "while" loops. [1] https://syzkaller.appspot.com/bug?id=3d7481a346958d9469bebbeb0537d5f056bdd6e8 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+18df353d7540aa6b5467@syzkaller.appspotmail.com> Fixes: bc5a5e3f45d04784 ("n_tty: Don't wrap input buffer indices at buffer size") Cc: stable <stable@vger.kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-28swiotlb: export swiotlb_dma_opsChristoph Hellwig1-0/+1
For architectures that do not use per-device dma ops we need to export the dma_map_ops structure returned from get_arch_dma_ops(). Fixes: 10314e09 ("riscv: add swiotlb support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Andreas Schwab <schwab@suse.de>
2018-06-28Btrfs: fix mount failure when qgroup rescan is in progressFilipe Manana1-3/+10
If a power failure happens while the qgroup rescan kthread is running, the next mount operation will always fail. This is because of a recent regression that makes qgroup_rescan_init() incorrectly return -EINVAL when we are mounting the filesystem (through btrfs_read_qgroup_config()). This causes the -EINVAL error to be returned regardless of any qgroup flags being set instead of returning the error only when neither of the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON are set. A test case for fstests follows up soon. Fixes: 9593bf49675e ("btrfs: qgroup: show more meaningful qgroup_rescan_init error message") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>