| Commit message (Collapse) | Author | Files | Lines |
|
The physical address may exceed 32 bits on 32-bit systems with more than
32 bits of physcial address. Use PFN_PHYS() in devmem_is_allowed(), or
the physical address may overflow and be truncated.
We found this bug when mapping a high addresses through devmem tool,
when CONFIG_STRICT_DEVMEM is enabled on the ARM with ARM_LPAE and devmem
is used to map a high address that is not in the iomem address range, an
unexpected error indicating no permission is returned.
This bug was initially introduced from v2.6.37, and the function was
moved to lib in v5.11.
Link: https://lkml.kernel.org/r/20210731025057.78825-1-wangliang101@huawei.com
Fixes: 087aaffcdf9c ("ARM: implement CONFIG_STRICT_DEVMEM by disabling access to RAM via /dev/mem")
Fixes: 527701eda5f1 ("lib: Add a generic version of devmem_is_allowed()")
Signed-off-by: Liang Wang <wangliang101@huawei.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Liang Wang <wangliang101@huawei.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org> [2.6.37+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When mod_objcg_state() is called with a pgdat that is different from
that in the obj_stock, the old lruvec data cached in obj_stock are
flushed out. Unfortunately, they were flushed to the new pgdat and so
the data go to the wrong node. This will screw up the slab data
reported in /sys/devices/system/node/node*/meminfo.
Fix that by flushing the data to the cached pgdat instead.
Link: https://lkml.kernel.org/r/20210802143834.30578-1-longman@redhat.com
Fixes: 68ac5b3c8db2 ("mm/memcg: cache vmstat data in percpu memcg_stock_pcp")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Doing some extended tests and polishing the man page update for
MADV_POPULATE_(READ|WRITE), I realized that we end up converting also
SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another
madvise() user error.
We want to report only problematic mappings and permission problems that
the user could have know as -EINVAL.
Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL,
but instead indicate -EFAULT to user space. While we could also convert
it to -ENOMEM, using -EFAULT looks more helpful when user space might
want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is
not part of an final Linux release and we can still adjust the behavior.
Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com
Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Vijayanand Jitta reports:
Consider the scenario where CONFIG_SLUB_DEBUG_ON is set and we would
want to disable slub_debug for few slabs. Using boot parameter with
slub_debug=-,slab_name syntax doesn't work as expected i.e; only
disabling debugging for the specified list of slabs. Instead it
disables debugging for all slabs, which is wrong.
This patch fixes it by delaying the moment when the global slub_debug
flags variable is updated. In case a "slub_debug=-,slab_name" has been
passed, the global flags remain as initialized (depending on
CONFIG_SLUB_DEBUG_ON enabled or disabled) and are not simply reset to 0.
Link: https://lkml.kernel.org/r/8a3d992a-473a-467b-28a0-4ad2ff60ab82@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Vijayanand Jitta <vjitta@codeaurora.org>
Reviewed-by: Vijayanand Jitta <vjitta@codeaurora.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The unit test kmalloc_pagealloc_invalid_free makes sure that for the
higher order slub allocation which goes to page allocator, the free is
called with the correct address i.e. the virtual address of the head
page.
Commit f227f0faf63b ("slub: fix unreclaimable slab stat for bulk free")
unified the free code paths for page allocator based slub allocations
but instead of using the address passed by the caller, it extracted the
address from the page. Thus making the unit test
kmalloc_pagealloc_invalid_free moot. So, fix this by using the address
passed by the caller.
Should we fix this? I think yes because dev expect kasan to catch these
type of programming bugs.
Link: https://lkml.kernel.org/r/20210802180819.1110165-1-shakeelb@google.com
Fixes: f227f0faf63b ("slub: fix unreclaimable slab stat for bulk free")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The address still includes the tags when it is printed. With hardware
tag-based kasan enabled, we will get a false positive KASAN issue when
we access metadata.
Reset the tag before we access the metadata.
Link: https://lkml.kernel.org/r/20210804090957.12393-3-Kuan-Ying.Lee@mediatek.com
Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata")
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "kasan, slub: reset tag when printing address", v3.
With hardware tag-based kasan enabled, we reset the tag when we access
metadata to avoid from false alarm.
This patch (of 2):
Kmemleak needs to scan kernel memory to check memory leak. With hardware
tag-based kasan enabled, when it scans on the invalid slab and
dereference, the issue will occur as below.
Hardware tag-based KASAN doesn't use compiler instrumentation, we can not
use kasan_disable_current() to ignore tag check.
Based on the below report, there are 11 0xf7 granules, which amounts to
176 bytes, and the object is allocated from the kmalloc-256 cache. So
when kmemleak accesses the last 256-176 bytes, it causes faults, as those
are marked with KASAN_KMALLOC_REDZONE == KASAN_TAG_INVALID == 0xfe.
Thus, we reset tags before accessing metadata to avoid from false positives.
BUG: KASAN: out-of-bounds in scan_block+0x58/0x170
Read at addr f7ff0000c0074eb0 by task kmemleak/138
Pointer tag: [f7], memory tag: [fe]
CPU: 7 PID: 138 Comm: kmemleak Not tainted 5.14.0-rc2-00001-g8cae8cd89f05-dirty #134
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x1b0
show_stack+0x1c/0x30
dump_stack_lvl+0x68/0x84
print_address_description+0x7c/0x2b4
kasan_report+0x138/0x38c
__do_kernel_fault+0x190/0x1c4
do_tag_check_fault+0x78/0x90
do_mem_abort+0x44/0xb4
el1_abort+0x40/0x60
el1h_64_sync_handler+0xb4/0xd0
el1h_64_sync+0x78/0x7c
scan_block+0x58/0x170
scan_gray_list+0xdc/0x1a0
kmemleak_scan+0x2ac/0x560
kmemleak_scan_thread+0xb0/0xe0
kthread+0x154/0x160
ret_from_fork+0x10/0x18
Allocated by task 0:
kasan_save_stack+0x2c/0x60
__kasan_kmalloc+0xec/0x104
__kmalloc+0x224/0x3c4
__register_sysctl_paths+0x200/0x290
register_sysctl_table+0x2c/0x40
sysctl_init+0x20/0x34
proc_sys_init+0x3c/0x48
proc_root_init+0x80/0x9c
start_kernel+0x648/0x6a4
__primary_switched+0xc0/0xc8
Freed by task 0:
kasan_save_stack+0x2c/0x60
kasan_set_track+0x2c/0x40
kasan_set_free_info+0x44/0x54
____kasan_slab_free.constprop.0+0x150/0x1b0
__kasan_slab_free+0x14/0x20
slab_free_freelist_hook+0xa4/0x1fc
kfree+0x1e8/0x30c
put_fs_context+0x124/0x220
vfs_kern_mount.part.0+0x60/0xd4
kern_mount+0x24/0x4c
bdev_cache_init+0x70/0x9c
vfs_caches_init+0xdc/0xf4
start_kernel+0x638/0x6a4
__primary_switched+0xc0/0xc8
The buggy address belongs to the object at ffff0000c0074e00
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 176 bytes inside of
256-byte region [ffff0000c0074e00, ffff0000c0074f00)
The buggy address belongs to the page:
page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x100074
head:(____ptrval____) order:2 compound_mapcount:0 compound_pincount:0
flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff|kasantag=0x0)
raw: 0bfffc0000010200 0000000000000000 dead000000000122 f5ff0000c0002300
raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff0000c0074c00: f0 f0 f0 f0 f0 f0 f0 f0 f0 fe fe fe fe fe fe fe
ffff0000c0074d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
>ffff0000c0074e00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 fe fe fe fe fe
^
ffff0000c0074f00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
ffff0000c0075000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Disabling lock debugging due to kernel taint
kmemleak: 181 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
Link: https://lkml.kernel.org/r/20210804090957.12393-1-Kuan-Ying.Lee@mediatek.com
Link: https://lkml.kernel.org/r/20210804090957.12393-2-Kuan-Ying.Lee@mediatek.com
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is a race between iterating over requests in
nbd_clear_que() and completing requests in recv_work(),
which can lead to double completion of a request.
To fix it, flush the recv worker before iterating over
the requests and don't abort the completed request
while iterating.
Fixes: 96d97e17828f ("nbd: clear_sock on netlink disconnect")
Reported-by: Jiang Yadong <jiangyadong@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210813151330.96-1-xieyongji@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This example is missing a few fixes that are in the liburing version,
synchronize with the upstream version.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We run a test that delete and recover devcies frequently(two devices on
the same host), and we found that 'active_queues' is super big after a
period of time.
If device a and device b share a tag set, and a is deleted, then
blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there
is only one queue that are using the tag set. However, if b is still
active, the active_queues of b might never be cleared even if b is
deleted.
Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fixes: 77e89afc25f3 ("PCI/MSI: Protect msi_desc::masked for multi-MSI")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Set the min_level for the TDP iterator at the root level when zapping all
SPTEs to optimize the iterator's try_step_down(). Zapping a non-leaf
SPTE will recursively zap all its children, thus there is no need for the
iterator to attempt to step down. This avoids rereading the top-level
SPTEs after they are zapped by causing try_step_down() to short-circuit.
In most cases, optimizing try_step_down() will be in the noise as the cost
of zapping SPTEs completely dominates the overall time. The optimization
is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
is zapping in response to multiple memslot updates when userspace is adding
and removing read-only memslots for option ROMs. In that case, the task
doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
for read and thus can be a noisy neighbor of sorts.
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
really zap all SPTEs in this case. As is, zap_gfn_range() skips non-leaf
SPTEs whose range exceeds the range to be zapped. If shadow_phys_bits is
not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
paging, the "zap all" flows will skip top-level SPTEs whose range extends
beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
Use the current upper bound (based on host.MAXPHYADDR) to detect that the
caller wants to zap all SPTEs, e.g. instead of using the max theoretical
gfn, 1 << (52 - 12). The more precise upper bound allows the TDP iterator
to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
help future debuggers should KVM decide to leak SPTEs again.
The bug is most easily reproduced by running (and unloading!) KVM in a
VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
=============================================================================
BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack_lvl+0x45/0x59
slab_err+0x95/0xc9
__kmem_cache_shutdown.cold+0x3c/0x158
kmem_cache_destroy+0x3d/0xf0
kvm_mmu_module_exit+0xa/0x30 [kvm]
kvm_arch_exit+0x5d/0x90 [kvm]
kvm_exit+0x78/0x90 [kvm]
vmx_exit+0x1a/0x50 [kvm_intel]
__x64_sys_delete_module+0x13f/0x220
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
in L2 or if the VM-Exit should be forwarded to L1. The current logic fails
to account for the case where #PF is intercepted to handle
guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
L1. At best, L1 will complain and inject the #PF back into L2. At
worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
page faults.
Note, while the bug was technically introduced by the commit that added
support for the MAXPHYADDR madness, the shame is all on commit
a0c134347baf ("KVM: VMX: introduce vmx_need_pf_intercept").
Fixes: 1dbf5d68af6f ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
Cc: stable@vger.kernel.org
Cc: Peter Shier <pshier@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812045615.3167686-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When a nested EPT violation/misconfig is injected into the guest,
the shadow EPT PTEs associated with that address need to be synced.
This is done by kvm_inject_emulated_page_fault() before it calls
nested_ept_inject_page_fault(). However, that will only sync the
shadow EPT PTE associated with the current L1 EPTP. Since the ASID
is based on EP4TA rather than the full EPTP, so syncing the current
EPTP is not enough. The SPTEs associated with any other L1 EPTPs
in the prev_roots cache with the same EP4TA also need to be synced.
Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20210806222229.1645356-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
hv_vcpu is initialized again a dozen lines below, and at this
point vcpu->arch.hyperv is not valid. Remove the initializer.
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Remove an ancient restriction that disallowed exposing EFER.NX to the
guest if EFER.NX=0 on the host, even if NX is fully supported by the CPU.
The motivation of the check, added by commit 2cc51560aed0 ("KVM: VMX:
Avoid saving and restoring msr_efer on lightweight vmexit"), was to rule
out the case of host.EFER.NX=0 and guest.EFER.NX=1 so that KVM could run
the guest with the host's EFER.NX and thus avoid context switching EFER
if the only divergence was the NX bit.
Fast forward to today, and KVM has long since stopped running the guest
with the host's EFER.NX. Not only does KVM context switch EFER if
host.EFER.NX=1 && guest.EFER.NX=0, KVM also forces host.EFER.NX=0 &&
guest.EFER.NX=1 when using shadow paging (to emulate SMEP). Furthermore,
the entire motivation for the restriction was made obsolete over a decade
ago when Intel added dedicated host and guest EFER fields in the VMCS
(Nehalem timeframe), which reduced the overhead of context switching EFER
from 400+ cycles (2 * WRMSR + 1 * RDMSR) to a mere ~2 cycles.
In practice, the removed restriction only affects non-PAE 32-bit kernels,
as EFER.NX is set during boot if NX is supported and the kernel will use
PAE paging (32-bit or 64-bit), regardless of whether or not the kernel
will actually use NX itself (mark PTEs non-executable).
Alternatively and/or complementarily, startup_32_smp() in head_32.S could
be modified to set EFER.NX=1 regardless of paging mode, thus eliminating
the scenario where NX is supported but not enabled. However, that runs
the risk of breaking non-KVM non-PAE kernels (though the risk is very,
very low as there are no known EFER.NX errata), and also eliminates an
easy-to-use mechanism for stressing KVM's handling of guest vs. host EFER
across nested virtualization transitions.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210805183804.1221554-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Creating a new sub monitoring group in the root /sys/fs/resctrl leads to
getting the "Unavailable" value for mbm_total_bytes and mbm_local_bytes
on the entire filesystem.
Steps to reproduce:
1. mount -t resctrl resctrl /sys/fs/resctrl/
2. cd /sys/fs/resctrl/
3. cat mon_data/mon_L3_00/mbm_total_bytes
23189832
4. Create sub monitor group:
mkdir mon_groups/test1
5. cat mon_data/mon_L3_00/mbm_total_bytes
Unavailable
When a new monitoring group is created, a new RMID is assigned to the
new group. But the RMID is not active yet. When the events are read on
the new RMID, it is expected to report the status as "Unavailable".
When the user reads the events on the default monitoring group with
multiple subgroups, the events on all subgroups are consolidated
together. Currently, if any of the RMID reads report as "Unavailable",
then everything will be reported as "Unavailable".
Fix the issue by discarding the "Unavailable" reads and reporting all
the successful RMID reads. This is not a problem on Intel systems as
Intel reports 0 on Inactive RMIDs.
Fixes: d89b7379015f ("x86/intel_rdt/cqm: Add mon_data")
Reported-by: Paweł Szulik <pawel.szulik@intel.com>
Signed-off-by: Babu Moger <Babu.Moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213311
Link: https://lkml.kernel.org/r/162793309296.9224.15871659871696482080.stgit@bmoger-ubuntu
|
|
There's a potential deadlock case when remove the vsock device or
process the RESET event:
vsock_for_each_connected_socket:
spin_lock_bh(&vsock_table_lock) ----------- (1)
...
virtio_vsock_reset_sock:
lock_sock(sk) --------------------- (2)
...
spin_unlock_bh(&vsock_table_lock)
lock_sock() may do initiative schedule when the 'sk' is owned by
other thread at the same time, we would receivce a warning message
that "scheduling while atomic".
Even worse, if the next task (selected by the scheduler) try to
release a 'sk', it need to request vsock_table_lock and the deadlock
occur, cause the system into softlockup state.
Call trace:
queued_spin_lock_slowpath
vsock_remove_bound
vsock_remove_sock
virtio_transport_release
__vsock_release
vsock_release
__sock_release
sock_close
__fput
____fput
So we should not require sk_lock in this case, just like the behavior
in vhost_vsock or vmci.
Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20210812053056.1699-1-longpeng2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Make wwan_create_dev() to return either valid or error pointer,
In some cases it may return NULL. Prevent this by converting
it to the respective error pointer.
Fixes: 9a44c1cc6388 ("net: Add a WWAN subsystem")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Link: https://lore.kernel.org/r/20210811124845.10955-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
During unlink/rename/lease break, deferred work for close is
scheduled immediately but in an asynchronous manner which might
lead to race with actual(unlink/rename) commands.
This change will schedule close synchronously which will avoid
the race conditions with other commands.
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org # 5.13
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When rename is executed on directory which has files for which
close is deferred, then rename will fail with EACCES.
This patch will try to close all deferred files when EACCES is received
and retry rename on a directory.
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Cc: stable@vger.kernel.org # 5.13
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
There is a TOCTOU issue in set_evtchn_to_irq. Rows in the evtchn_to_irq
mapping are lazily allocated in this function. The check whether the row
is already present and the row initialization is not synchronized. Two
threads can at the same time allocate a new row for evtchn_to_irq and
add the irq mapping to the their newly allocated row. One thread will
overwrite what the other has set for evtchn_to_irq[row] and therefore
the irq mapping is lost. This will trigger a BUG_ON later in
bind_evtchn_to_cpu:
INFO: pci 0000:1a:15.4: [1d0f:8061] type 00 class 0x010802
INFO: nvme 0000:1a:12.1: enabling device (0000 -> 0002)
INFO: nvme nvme77: 1/0/0 default/read/poll queues
CRIT: kernel BUG at drivers/xen/events/events_base.c:427!
WARN: invalid opcode: 0000 [#1] SMP NOPTI
WARN: Workqueue: nvme-reset-wq nvme_reset_work [nvme]
WARN: RIP: e030:bind_evtchn_to_cpu+0xc2/0xd0
WARN: Call Trace:
WARN: set_affinity_irq+0x121/0x150
WARN: irq_do_set_affinity+0x37/0xe0
WARN: irq_setup_affinity+0xf6/0x170
WARN: irq_startup+0x64/0xe0
WARN: __setup_irq+0x69e/0x740
WARN: ? request_threaded_irq+0xad/0x160
WARN: request_threaded_irq+0xf5/0x160
WARN: ? nvme_timeout+0x2f0/0x2f0 [nvme]
WARN: pci_request_irq+0xa9/0xf0
WARN: ? pci_alloc_irq_vectors_affinity+0xbb/0x130
WARN: queue_request_irq+0x4c/0x70 [nvme]
WARN: nvme_reset_work+0x82d/0x1550 [nvme]
WARN: ? check_preempt_wakeup+0x14f/0x230
WARN: ? check_preempt_curr+0x29/0x80
WARN: ? nvme_irq_check+0x30/0x30 [nvme]
WARN: process_one_work+0x18e/0x3c0
WARN: worker_thread+0x30/0x3a0
WARN: ? process_one_work+0x3c0/0x3c0
WARN: kthread+0x113/0x130
WARN: ? kthread_park+0x90/0x90
WARN: ret_from_fork+0x3a/0x50
This patch sets evtchn_to_irq rows via a cmpxchg operation so that they
will be set only once. The row is now cleared before writing it to
evtchn_to_irq in order to not create a race once the row is visible for
other threads.
While at it, do not require the page to be zeroed, because it will be
overwritten with -1's in clear_evtchn_to_irq_row anyway.
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Fixes: d0b075ffeede ("xen/events: Refactor evtchn_to_irq array to be dynamically allocated")
Link: https://lore.kernel.org/r/20210812130930.127134-1-mheyne@amazon.de
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
Skip (omit) any version string info that is parenthesized.
Warning: objdump version 15) is older than 2.19
Warning: Skipping posttest.
where 'objdump -v' says:
GNU objdump (GNU Binutils; SUSE Linux Enterprise 15) 2.35.1.20201123-7.18
Fixes: 8bee738bb1979 ("x86: Fix objdump version check in chkobjdump.awk for different formats.")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210731000146.2720-1-rdunlap@infradead.org
|
|
The current comment states that we check if the 64-bit kernel mapping
overlaps with the last 4K of the address space that is reserved to
error values in create_kernel_page_table, which is not the case since it
is done in setup_vm. But anyway, remove the reference to any function
and simply note that in 64-bit kernel, the check should be done as soon
as the kernel mapping base address is known.
Fixes: db6b84a368b4 ("riscv: Make sure the kernel mapping does not overlap with IS_ERR_VALUE")
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
The RISC-V special option '-mno-relax' which to disable linker relaxations
is supported by GCC8+. For GCC7 and lower versions do not support this
option.
Fixes: fba8a8674f68 ("RISC-V: Add kexec support")
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
On PowerVM, CPU-less nodes can be populated with hot-plugged CPUs at
runtime. Today, the IPI is not created for such nodes, and hot-plugged
CPUs use a bogus IPI, which leads to soft lockups.
We can not directly allocate and request the IPI on demand because
bringup_up() is called under the IRQ sparse lock. The alternative is
to allocate the IPIs for all possible nodes at startup and to request
the mapping on demand when the first CPU of a node is brought up.
Fixes: 7dcc37b3eff9 ("powerpc/xive: Map one IPI interrupt per node")
Cc: stable@vger.kernel.org # v5.13
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210807072057.184698-1-clg@kaod.org
|
|
single_step_exception() is called by emulate_single_step() which
is called from (at least) alignment exception() handler and
program_check_exception() handler.
Redefine it as a regular __single_step_exception() which is called
by both single_step_exception() handler and emulate_single_step()
function.
Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
Cc: stable@vger.kernel.org # v5.12+
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/aed174f5cbc06f2cf95233c071d8aac948e46043.1628611921.git.christophe.leroy@csgroup.eu
|
|
An interrupt handler shall not be called from another interrupt
handler otherwise this leads to problems like the following:
Kernel attempted to write user page (afd4fa84) - exploit attempt? (uid: 1000)
------------[ cut here ]------------
Bug: Write fault blocked by KUAP!
WARNING: CPU: 0 PID: 1617 at arch/powerpc/mm/fault.c:230 do_page_fault+0x484/0x720
Modules linked in:
CPU: 0 PID: 1617 Comm: sshd Tainted: G W 5.13.0-pmac-00010-g8393422eb77 #7
NIP: c001b77c LR: c001b77c CTR: 00000000
REGS: cb9e5bc0 TRAP: 0700 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 00021032 <ME,IR,DR,RI> CR: 24942424 XER: 00000000
GPR00: c001b77c cb9e5c80 c1582c00 00000021 3ffffbff 085b0000 00000027 c8eb644c
GPR08: 00000023 00000000 00000000 00000000 24942424 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
GPR24: 00000040 afd4fa96 00000040 02000000 c1fda6c0 afd4fa84 00000300 cb9e5cc0
NIP [c001b77c] do_page_fault+0x484/0x720
LR [c001b77c] do_page_fault+0x484/0x720
Call Trace:
[cb9e5c80] [c001b77c] do_page_fault+0x484/0x720 (unreliable)
[cb9e5cb0] [c000424c] DataAccess_virt+0xd4/0xe4
--- interrupt: 300 at __copy_tofrom_user+0x110/0x20c
NIP: c001f9b4 LR: c03250a0 CTR: 00000004
REGS: cb9e5cc0 TRAP: 0300 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 00009032 <EE,ME,IR,DR,RI> CR: 48028468 XER: 20000000
DAR: afd4fa84 DSISR: 0a000000
GPR00: 20726f6f cb9e5d80 c1582c00 00000004 cb9e5e3a 00000016 afd4fa80 00000000
GPR08: 3835202d 72777872 2d78722d 00000004 28028464 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
GPR24: 00000040 afd4fa96 00000040 cb9e5e0c 00000daa a0000000 cb9e5e98 afd4fa56
NIP [c001f9b4] __copy_tofrom_user+0x110/0x20c
LR [c03250a0] _copy_to_iter+0x144/0x990
--- interrupt: 300
[cb9e5d80] [c03e89c0] n_tty_read+0xa4/0x598 (unreliable)
[cb9e5df0] [c03e2a0c] tty_read+0xdc/0x2b4
[cb9e5e80] [c0156bf8] vfs_read+0x274/0x340
[cb9e5f00] [c01571ac] ksys_read+0x70/0x118
[cb9e5f30] [c0016048] ret_from_syscall+0x0/0x28
--- interrupt: c00 at 0xa7855c88
NIP: a7855c88 LR: a7855c5c CTR: 00000000
REGS: cb9e5f40 TRAP: 0c00 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 2402446c XER: 00000000
GPR00: 00000003 afd4ec70 a72137d0 0000000b afd4ecac 00004000 0065a990 00000800
GPR08: 00000000 a7947930 00000000 00000004 c15831b0 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 0065a9e0 00000001 0065fac0
GPR24: 00000000 00000089 00664050 00000000 00668e30 a720c8dc a7943ff4 0065f9b0
NIP [a7855c88] 0xa7855c88
LR [a7855c5c] 0xa7855c5c
--- interrupt: c00
Instruction dump:
3884aa88 38630178 48076861 807f0080 48042e45 2f830000 419e0148 3c80c079
3c60c076 38841be4 386301c0 4801f705 <0fe00000> 3860000b 4bfffe30 3c80c06b
---[ end trace fd69b91a8046c2e5 ]---
Here the problem is that by re-enterring an exception handler,
kuap_save_and_lock() is called a second time with this time KUAP
access locked, leading to regs->kuap being overwritten hence
KUAP not being unlocked at exception exit as expected.
Do not call do_IRQ() from timer_interrupt() directly. Instead,
redefine do_IRQ() as a standard function named __do_IRQ(), and
call it from both do_IRQ() and time_interrupt() handlers.
Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
Cc: stable@vger.kernel.org # v5.12+
Reported-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c17d234f4927d39a1d7100864a8e1145323d33a0.1628611927.git.christophe.leroy@csgroup.eu
|
|
We still have quite a bit more work to do with overall reworking of
the ttm-based dg1 code, but the uapi stuff is now finalized with the
latest pull. So remove that.
This also fixes kerneldoc build warnings because we've included the
same headers in two places, resulting in sphinx complaining about
duplicated symbols. This regression has been created when we moved the
uapi definitions to the real include/uapi/ folder in 727ecd99a4c9
("drm/doc/rfc: drop the i915_gem_lmem.h header")
v2: Fix a few references that I missed, the htmldocs build took
forever.
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Tested-by Stephen Rothwell <sfr@canb.auug.org.au> (v1)
References: https://lore.kernel.org/dri-devel/20210603193242.1ce99344@canb.auug.org.au/
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 727ecd99a4c9 ("drm/doc/rfc: drop the i915_gem_lmem.h header")
Cc: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210810142748.1983271-1-daniel.vetter@ffwll.ch
(cherry picked from commit dae2d28832968751f7731336b560a4a84a197b76)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
The SFC_DONE register lives within the corresponding VD0/VD2/VD4/VD6
forcewake domain and is not accessible if the vdbox in that domain is
fused off and the forcewake is not initialized.
This mistake went unnoticed because until recently we were using the
wrong register offset for the SFC_DONE register; once the register
offset was corrected, we started hitting errors like
<4> [544.989065] i915 0000:cc:00.0: Uninitialized forcewake domain(s) 0x80 accessed at 0x1ce000
on parts with fused-off vdbox engines.
Fixes: e50dbdbfd9fb ("drm/i915/tgl: Add SFC instdone to error state")
Fixes: 9c9c6d0ab08a ("drm/i915: Correct SFC_DONE register offset")
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210806174130.1058960-1-matthew.d.roper@intel.com
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
(cherry picked from commit c5589bb5dccb0c5cb74910da93663f489589f3ce)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
[Changed Fixes tag to match the cherry-picked 82929a2140eb]
|
|
Till DISPLAY12 the PIPE_MISC bits 5-7 are used to set the
Dithering BPC, with valid values of 6, 8, 10 BPC.
For ADLP+ these bits are used to set the PORT OUTPUT BPC, with valid
values of: 6, 8, 10, 12 BPC, and need to be programmed whether
dithering is enabled or not.
This patch:
-corrects the bits 5-7 for PIPE MISC register for 12 BPC.
-renames the bits and mask to have generic names for these bits for
dithering bpc and port output bpc.
v3: Added a note for MIPI DSI which uses the PIPE_MISC for readout
for pipe_bpp. (Uma Shankar)
v2: Added 'display' to the subject and fixes tag. (Uma Shankar)
Fixes: 756f85cffef2 ("drm/i915/bdw: Broadwell has PIPEMISC")
Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> (v1)
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: intel-gfx@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210811051857.109723-1-ankit.k.nautiyal@intel.com
(cherry picked from commit 70418a68713c13da3f36c388087d0220b456a430)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
The call to sja1105_mdiobus_unregister is present in the error path but
absent from the main driver unbind path.
Fixes: 5a8f09748ee7 ("net: dsa: sja1105: register the MDIO buses for 100base-T1 and 100base-TX")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
IRQs are getting armed on shutdown causing the system to immediately
wake back up.
Link: https://lkml.org/lkml/2021/8/2/1114
Reported-by: nix.or.die@googlemail.com
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Tested-by: Gabriel Craciunescu <nix.or.die@gmail.com>
CC: Raul E Rangel <rrangel@chromium.org>
Fixes: d62bd5ce12d7 ("pinctrl: amd: Implement irq_set_wake")
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20210809201513.12367-1-mario.limonciello@amd.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
This reverts commit 0efea3c649f0 because of:
- The returning -ENOBUF error is fine on socket buffer allocation.
- There is side effect in the calling path
tipc_node_xmit()->tipc_link_xmit() when checking error code returning.
Fixes: 0efea3c649f0 ("tipc: Return the correct errno code")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The ocelot driver makes use of regmap, wrapping it with driver specific
operations that are thin wrappers around the core regmap APIs. These are
exported with EXPORT_SYMBOL, dropping the _GPL from the core regmap
exports which is frowned upon. Add _GPL suffixes to at least the APIs that
are doing register I/O.
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some arches support cmpxchg() on 4-byte and 8-byte only.
Increase mr_ifc_count width to 32bit to fix this problem.
Fixes: 4a2b285e7e10 ("net: igmp: fix data-race in igmp_ifc_timer_expire()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20210811195715.3684218-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
I'm still going to continue looking after the Renesas Ethernet drivers and
device tree bindings. Now my new employer, Open Mobile Platform (OMP), will
pay for all my upstream work. Let's switch to my OMP email for the reviews.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/9c212711-a0d7-39cd-7840-ff7abf938da1@omp.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently if BBR congestion control is initialized after more than 2B
packets have been delivered, depending on the phase of the
tp->delivered counter the tracking of BBR round trips can get stuck.
The bug arises because if tp->delivered is between 2^31 and 2^32 at
the time the BBR congestion control module is initialized, then the
initialization of bbr->next_rtt_delivered to 0 will cause the logic to
believe that the end of the round trip is still billions of packets in
the future. More specifically, the following check will fail
repeatedly:
!before(rs->prior_delivered, bbr->next_rtt_delivered)
and thus the connection will take up to 2B packets delivered before
that check will pass and the connection will set:
bbr->round_start = 1;
This could cause many mechanisms in BBR to fail to trigger, for
example bbr_check_full_bw_reached() would likely never exit STARTUP.
This bug is 5 years old and has not been observed, and as a practical
matter this would likely rarely trigger, since it would require
transferring at least 2B packets, or likely more than 3 terabytes of
data, before switching congestion control algorithms to BBR.
This patch is a stable candidate for kernels as far back as v4.9,
when tcp_bbr.c was added.
Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Kevin Yang <yyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Drivers such as sja1105 and stmmac that call xpcs_create() expects an
error returned by the pcs-xpcs module, but this was not the case on
failed to allocate memory.
Fixed this by returning an -ENOMEM instead of a NULL pointer.
Fixes: 3ad1d171548e ("net: dsa: sja1105: migrate to xpcs for SGMII")
Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20210810085812.1808466-1-vee.khee.wong@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed
that my Ethernet port to which a bond and a VLAN interface are attached
appeared to remain up after resuming from suspend with the cable unplugged
(and that problem still persists with 5.10-LTS).
It happens that the following happens:
- the network driver (e1000e here) prepares to suspend, calls e1000e_down()
which calls netif_carrier_off() to signal that the link is going down.
- netif_carrier_off() adds a link_watch event to the list of events for
this device
- the device is completely stopped.
- the machine suspends
- the cable is unplugged and the machine brought to another location
- the machine is resumed
- the queued linkwatch events are processed for the device
- the device doesn't yet have the __LINK_STATE_PRESENT bit and its events
are silently dropped
- the device is resumed with its link down
- the upper VLAN and bond interfaces are never notified that the link had
been turned down and remain up
- the only way to provoke a change is to physically connect the machine
to a port and possibly unplug it.
The state after resume looks like this:
$ ip -br li | egrep 'bond|eth'
bond0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP>
eth0 DOWN e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP>
eth0.2@eth0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>
Placing an explicit call to netdev_state_change() either in the suspend
or the resume code in the NIC driver worked around this but the solution
is not satisfying.
The issue in fact really is in link_watch that loses events while it
ought not to. It happens that the test for the device being present was
added by commit 124eee3f6955 ("net: linkwatch: add check for netdevice
being present to linkwatch_do_dev") in 4.20 to avoid an access to
devices that are not present.
Instead of dropping events, this patch proceeds slightly differently by
postponing their handling so that they happen after the device is fully
resumed.
Fixes: 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev")
Link: https://lists.openwall.net/netdev/2018/03/15/62
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a
WARN_ON_ONCE() call"). The added cgroup support has the following issues:
* It breaks cgroup interface file format rule by adding custom elements to a
nested key-value file.
* It registers mq-deadline as a cgroup-aware policy even though all it's
doing is collecting per-cgroup stats. Even if we need these stats, this
isn't the right way to add them.
* It hasn't been reviewed from cgroup side.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A recent change in LLVM causes module_{c,d}tor sections to appear when
CONFIG_K{A,C}SAN are enabled, which results in orphan section warnings
because these are not handled anywhere:
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_ctor) is being placed in '.text.asan.module_ctor'
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_dtor) is being placed in '.text.asan.module_dtor'
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.tsan.module_ctor) is being placed in '.text.tsan.module_ctor'
Fangrui explains: "the function asan.module_ctor has the SHF_GNU_RETAIN
flag, so it is in a separate section even with -fno-function-sections
(default)".
Place them in the TEXT_TEXT section so that these technologies continue
to work with the newer compiler versions. All of the KASAN and KCSAN
KUnit tests continue to pass after this change.
Cc: stable@vger.kernel.org
Link: https://github.com/ClangBuiltLinux/linux/issues/1432
Link: https://github.com/llvm/llvm-project/commit/7b789562244ee941b7bf2cefeb3fc08a59a01865
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210731023107.1932981-1-nathan@kernel.org
|
|
Use "fallthrough;" to address:
tools/testing/nvdimm/test/nfit.c: In function ‘nd_intel_test_finish_query’:
tools/testing/nvdimm/test/nfit.c:436:37: warning: this statement may
fall through [-Wimplicit-fallthrough=]
436 | fw->missed_activate = false;
| ~~~~~~~~~~~~~~~~~~~~^~~~~~~
tools/testing/nvdimm/test/nfit.c:438:9: note: here
438 | case FW_STATE_UPDATED:
| ^~~~
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Link: https://lore.kernel.org/r/162767522046.3313209.14767278726893995797.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
There are a few scenarios where init_active_labels() can return without
registering deactivate_labels() to run when the region is disabled. In
particular label error injection creates scenarios where a DIMM is
disabled, but labels on other DIMMs in the region become activated.
Arrange for init_active_labels() to always register deactivate_labels().
Reported-by: Krzysztof Kensicki <krzysztof.kensicki@intel.com>
Cc: <stable@vger.kernel.org>
Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation.")
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Link: https://lore.kernel.org/r/162766356450.3223041.1183118139023841447.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Fix the NFIT parsing code to treat a 0 index in a SPA Range Structure as
a special case and not match Region Mapping Structures that use 0 to
indicate that they are not mapped. Without this fix some platform BIOS
descriptions of "virtual disk" ranges do not result in the pmem driver
attaching to the range.
Details:
In addition to typical persistent memory ranges, the ACPI NFIT may also
convey "virtual" ranges. These ranges are indicated by a UUID in the SPA
Range Structure of UUID_VOLATILE_VIRTUAL_DISK, UUID_VOLATILE_VIRTUAL_CD,
UUID_PERSISTENT_VIRTUAL_DISK, or UUID_PERSISTENT_VIRTUAL_CD. The
critical difference between virtual ranges and UUID_PERSISTENT_MEMORY,
is that virtual do not support associations with Region Mapping
Structures. For this reason the "index" value of virtual SPA Range
Structures is allowed to be 0. If a platform BIOS decides to represent
NVDIMMs with disconnected "Region Mapping Structures" (range-index ==
0), the kernel may falsely associate them with standalone ranges where
the "SPA Range Structure Index" is also zero. When this happens the
driver may falsely require labels where "virtual disks" are expected to
be label-less. I.e. "label-less" is where the namespace-range ==
region-range and the pmem driver attaches with no user action to create
a namespace.
Cc: Jacek Zloch <jacek.zloch@intel.com>
Cc: Lukasz Sobieraj <lukasz.sobieraj@intel.com>
Cc: "Lee, Chun-Yi" <jlee@suse.com>
Cc: <stable@vger.kernel.org>
Fixes: c2f32acdf848 ("acpi, nfit: treat virtual ramdisk SPA as pmem region")
Reported-by: Krzysztof Rusocki <krzysztof.rusocki@intel.com>
Reported-by: Damian Bassa <damian.bassa@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Link: https://lore.kernel.org/r/162870796589.2521182.1240403310175570220.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
The desired behavior is to set the caller's filter count to thread's.
This value is reported via /proc, so this fixes the inaccurate count
exposed to userspace; it is not used for reference counting, etc.
Signed-off-by: Hsuan-Chi Kuo <hsuanchikuo@gmail.com>
Link: https://lore.kernel.org/r/20210304233708.420597-1-hsuanchikuo@gmail.com
Co-developed-by: Wiktor Garbacz <wiktorg@google.com>
Signed-off-by: Wiktor Garbacz <wiktorg@google.com>
Link: https://lore.kernel.org/lkml/20210810125158.329849-1-wiktorg@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Fixes: c818c03b661c ("seccomp: Report number of loaded filters in /proc/$pid/status")
|
|
In k210_fpioa_probe(), add missing calls to clk_disable_unprepare() in
case of error after cenabling the clk and pclk clocks. Also add missing
error handling when enabling pclk.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: d4c34d09ab03 ("pinctrl: Add RISC-V Canaan Kendryte K210 FPIOA driver")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210806004311.52859-1-damien.lemoal@wdc.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
Append i2c-sysfs to toctree in order to get rid of building warnings.
Fixes: 31df7195b100 ("Documentation: i2c: Add doc for I2C sysfs")
Signed-off-by: Hu Haowen <src.res@email.cn>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
|