summaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel/bootflag.c (unfollow)
Commit message (Collapse)AuthorFilesLines
2021-03-25x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declarationWei Yongjun1-1/+1
Address this GCC warning: arch/x86/kernel/kprobes/core.c:940:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration] 940 | static int nokprobe_inline kprobe_is_ss(struct kprobe_ctlblk *kcb) | ^~~~~~ [ mingo: Tidied up the changelog. ] Fixes: 6256e668b7af: ("x86/kprobes: Use int3 instead of debug trap for single-step") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20210324144502.1154883-1-weiyongjun1@huawei.com
2021-03-25x86/kprobes: Fix to identify indirect jmp and others using range caseMasami Hiramatsu1-24/+20
Fix can_boost() to identify indirect jmp and others using range case correctly. Since the condition in switch statement is opcode & 0xf0, it can not evaluate to 0xff case. This should be under the 0xf0 case. However, there is no reason to use the conbinations of the bit-masked condition and lower bit checking. Use range case to clean up the switch statement too. Fixes: 6256e668b7 ("x86/kprobes: Use int3 instead of debug trap for single-step") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/161666692308.1120877.4675552834049546493.stgit@devnote2
2021-03-25x86/kprobes: Fix to check non boostable prefixes correctlyMasami Hiramatsu1-5/+12
There are 2 bugs in the can_boost() function because of using x86 insn decoder. Since the insn->opcode never has a prefix byte, it can not find CS override prefix in it. And the insn->attr is the attribute of the opcode, thus inat_is_address_size_prefix( insn->attr) always returns false. Fix those by checking each prefix bytes with for_each_insn_prefix loop and getting the correct attribute for each prefix byte. Also, this removes unlikely, because this is a slow path. Fixes: a8d11cd0714f ("kprobes/x86: Consolidate insn decoder users for copying code") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/161666691162.1120877.2808435205294352583.stgit@devnote2
2021-03-23x86/kprobes: Use int3 instead of debug trap for single-stepMasami Hiramatsu3-188/+353
Use int3 instead of debug trap exception for single-stepping the probed instructions. Some instructions which change the ip registers or modify IF flags are emulated because those are not able to be single-stepped by int3 or may allow the interrupt while single-stepping. This actually changes the kprobes behavior. - kprobes can not probe following instructions; int3, iret, far jmp/call which get absolute address as immediate, indirect far jmp/call, indirect near jmp/call with addressing by memory (register-based indirect jmp/call are OK), and vmcall/vmlaunch/vmresume/vmxoff. - If the kprobe post_handler doesn't set before registering, it may not be called in some case even if you set it afterwards. (IOW, kprobe booster is enabled at registration, user can not change it) But both are rare issue, unsupported instructions will not be used in the kernel (or rarely used), and post_handlers are rarely used (I don't see it except for the test code). Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/161469874601.49483.11985325887166921076.stgit@devnote2
2021-03-23x86/kprobes: Identify far indirect JMP correctlyMasami Hiramatsu1-2/+1
Since Grp5 far indirect JMP is FF "mod 101 r/m", it should be (modrm & 0x38) == 0x28, and near indirect JMP is also 0x38 == 0x20. So we can mask modrm with 0x30 and check 0x20. This is actually what the original code does, it also doesn't care the last bit. So the result code is same. Thus, I think this is just a cosmetic cleanup. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/161469873475.49483.13257083019966335137.stgit@devnote2
2021-03-23x86/kprobes: Retrieve correct opcode for group instructionMasami Hiramatsu1-1/+5
Since the opcodes start from 0xff are group5 instruction group which is not 2 bytes opcode but the extended opcode determined by the MOD/RM byte. The commit abd82e533d88 ("x86/kprobes: Do not decode opcode in resume_execution()") used insn->opcode.bytes[1], but that is not correct. We have to refer the insn->modrm.bytes[1] instead. Fixes: abd82e533d88 ("x86/kprobes: Do not decode opcode in resume_execution()") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/161469872400.49483.18214724458034233166.stgit@devnote2
2021-03-17tools/insn: Restore the relative include paths for cross buildingBorislav Petkov1-3/+3
Building perf on ppc causes: In file included from util/intel-pt-decoder/intel-pt-insn-decoder.c:15: util/intel-pt-decoder/../../../arch/x86/lib/insn.c:14:10: fatal error: asm/inat.h: No such file or directory 14 | #include <asm/inat.h> /*__ignore_sync_check__ */ | ^~~~~~~~~~~~ Restore the relative include paths so that the compiler can find the headers. Fixes: 93281c4a9657 ("x86/insn: Add an insn_decode() API") Reported-by: Ian Rogers <irogers@google.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Ian Rogers <irogers@google.com> Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lkml.kernel.org/r/20210317150858.02b1bbc8@canb.auug.org.au
2021-03-15x86/insn: Make insn_complete() staticBorislav Petkov4-14/+14
... and move it above the only place it is used. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-22-bp@alien8.de
2021-03-15x86/insn: Remove kernel_insn_init()Borislav Petkov2-22/+0
Now that it is not needed anymore, drop it. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-21-bp@alien8.de
2021-03-15tools/perf: Convert to insn_decode()Borislav Petkov3-16/+19
Simplify code, no functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lkml.kernel.org/r/20210304174237.31945-20-bp@alien8.de
2021-03-15x86/tools/insn_sanity: Convert to insn_decode()Borislav Petkov1-4/+4
Simplify code, no functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-19-bp@alien8.de
2021-03-15tools/objtool: Convert to insn_decode()Borislav Petkov1-5/+4
Simplify code, no functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-18-bp@alien8.de
2021-03-15x86/tools/insn_decoder_test: Convert to insn_decode()Borislav Petkov1-4/+6
Simplify code, no functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-17-bp@alien8.de
2021-03-15x86/uprobes: Convert to insn_decode()Borislav Petkov1-4/+4
Simplify code, no functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-16-bp@alien8.de
2021-03-15x86/traps: Convert to insn_decode()Borislav Petkov1-3/+4
Simplify code, no functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-15-bp@alien8.de
2021-03-15x86/sev-es: Convert to insn_decode()Borislav Petkov1-12/+10
Simplify code, no functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-14-bp@alien8.de
2021-03-15x86/sev-es: Split vc_decode_insn()Borislav Petkov1-21/+38
Split it into two helpers - a user- and a kernel-mode one for readability. Yes, the original function body is not that convoluted but splitting it makes following through that code trivial than having to pay attention to each little difference when in user or in kernel mode. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-13-bp@alien8.de
2021-03-15x86/kprobes: Convert to insn_decode()Borislav Petkov2-8/+18
Simplify code, improve decoding error checking. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20210304174237.31945-12-bp@alien8.de
2021-03-15x86/mce: Convert to insn_decode()Borislav Petkov1-8/+4
Simplify code, no functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-11-bp@alien8.de
2021-03-15x86/alternative: Use insn_decode()Borislav Petkov1-3/+3
No functional changes, just simplification. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-10-bp@alien8.de
2021-03-15perf/x86/intel/ds: Check return values of insn decoder functionsBorislav Petkov1-5/+5
branch_type() doesn't need to call the full insn_decode() because it doesn't need it in all cases thus leave the calls separate. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-9-bp@alien8.de
2021-03-15perf/x86/intel/ds: Check insn_get_length() retvalBorislav Petkov1-6/+5
intel_pmu_pebs_fixup_ip() needs only the insn length so use the appropriate helper instead of a full decode. A full decode differs only in running insn_complete() on the decoded insn but that is not needed here. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-8-bp@alien8.de
2021-03-15x86/boot/compressed/sev-es: Convert to insn_decode()Borislav Petkov1-6/+5
Other than simplifying the code there should be no functional changes resulting from this. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-7-bp@alien8.de
2021-03-15x86/insn-eval: Handle return values from the decoderBorislav Petkov1-13/+21
Now that the different instruction-inspecting functions return a value, test that and return early from callers if error has been encountered. While at it, do not call insn_get_modrm() when calling insn_get_displacement() because latter will make sure to call insn_get_modrm() if ModRM hasn't been parsed yet. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-6-bp@alien8.de
2021-03-15x86/insn: Add an insn_decode() APIBorislav Petkov5-107/+452
Users of the instruction decoder should use this to decode instruction bytes. For that, have insn*() helpers return an int value to denote success/failure. When there's an error fetching the next insn byte and the insn falls short, return -ENODATA to denote that. While at it, make insn_get_opcode() more stricter as to whether what has seen so far is a valid insn and if not. Copy linux/kconfig.h for the tools-version of the decoder so that it can use IS_ENABLED(). Also, cast the INSN_MODE_KERN dummy define value to (enum insn_mode) for tools use of the decoder because perf tool builds with -Werror and errors out with -Werror=sign-compare otherwise. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20210304174237.31945-5-bp@alien8.de
2021-03-15x86/insn: Add a __ignore_sync_check__ markerBorislav Petkov10-20/+36
Add an explicit __ignore_sync_check__ marker which will be used to mark lines which are supposed to be ignored by file synchronization check scripts, its advantage being that it explicitly denotes such lines in the code. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20210304174237.31945-4-bp@alien8.de
2021-03-15x86/insn: Add @buf_len param to insn_init() kernel-doc commentBorislav Petkov2-0/+2
It wasn't documented so add it. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20210304174237.31945-3-bp@alien8.de
2021-03-15x86/insn: Rename insn_decode() to insn_decode_from_regs()Borislav Petkov4-7/+7
Rename insn_decode() to insn_decode_from_regs() to denote that it receives regs as param and uses registers from there during decoding. Free the former name for a more generic version of the function. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-2-bp@alien8.de
2021-03-14Linux 5.12-rc3v5.12-rc3Linus Torvalds1-1/+1
2021-03-14prctl: fix PR_SET_MM_AUXV kernel stack leakAlexey Dobriyan1-1/+1
Doing a prctl(PR_SET_MM, PR_SET_MM_AUXV, addr, 1); will copy 1 byte from userspace to (quite big) on-stack array and then stash everything to mm->saved_auxv. AT_NULL terminator will be inserted at the very end. /proc/*/auxv handler will find that AT_NULL terminator and copy original stack contents to userspace. This devious scheme requires CAP_SYS_RESOURCE. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13zram: fix broken page writebackMinchan Kim1-3/+3
commit 0d8359620d9b ("zram: support page writeback") introduced two problems. It overwrites writeback_store's return value as kstrtol's return value, which makes return value zero so user could see zero as return value of write syscall even though it wrote data successfully. It also breaks index value in the loop in that it doesn't increase the index any longer. It means it can write only first starting block index so user couldn't write all idle pages in the zram so lose memory saving chance. This patch fixes those issues. Link: https://lkml.kernel.org/r/20210312173949.2197662-2-minchan@kernel.org Fixes: 0d8359620d9b("zram: support page writeback") Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Amos Bianchi <amosbianchi@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: John Dias <joaodias@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13zram: fix return value on writeback_storeMinchan Kim1-3/+8
writeback_store's return value is overwritten by submit_bio_wait's return value. Thus, writeback_store will return zero since there was no IO error. In the end, write syscall from userspace will see the zero as return value, which could make the process stall to keep trying the write until it will succeed. Link: https://lkml.kernel.org/r/20210312173949.2197662-1-minchan@kernel.org Fixes: 3b82a051c101("drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store") Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: John Dias <joaodias@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13mm/memcg: set memcg when splitting pageZhou Guanghui1-0/+1
As described in the split_page() comment, for the non-compound high order page, the sub-pages must be freed individually. If the memcg of the first page is valid, the tail pages cannot be uncharged when be freed. For example, when alloc_pages_exact is used to allocate 1MB continuous physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is set). When make_alloc_exact free the unused 1MB and free_pages_exact free the applied 1MB, actually, only 4KB(one page) is uncharged. Therefore, the memcg of the tail page needs to be set when splitting a page. Michel: There are at least two explicit users of __GFP_ACCOUNT with alloc_exact_pages added recently. See 7efe8ef274024 ("KVM: arm64: Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713 ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not just a theoretical issue. Link: https://lkml.kernel.org/r/20210304074053.65527-3-zhouguanghui1@huawei.com Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rui Xiang <rui.xiang@huawei.com> Cc: Tianhong Ding <dingtianhong@huawei.com> Cc: Weilong Chen <chenweilong@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add ↵Zhou Guanghui3-14/+9
nr_pages argument Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass in page number argument. In this way, the interface name is more common and can be used by potential users. In addition, the complete info(memcg and flag) of the memcg needs to be set to the tail pages. Link: https://lkml.kernel.org/r/20210304074053.65527-2-zhouguanghui1@huawei.com Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Tianhong Ding <dingtianhong@huawei.com> Cc: Weilong Chen <chenweilong@huawei.com> Cc: Rui Xiang <rui.xiang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) signSergei Trofimovich1-1/+1
In https://bugs.gentoo.org/769614 Dmitry noticed that `ptrace(PTRACE_GET_SYSCALL_INFO)` does not return error sign properly. The bug is in mismatch between get/set errors: static inline long syscall_get_error(struct task_struct *task, struct pt_regs *regs) { return regs->r10 == -1 ? regs->r8:0; } static inline long syscall_get_return_value(struct task_struct *task, struct pt_regs *regs) { return regs->r8; } static inline void syscall_set_return_value(struct task_struct *task, struct pt_regs *regs, int error, long val) { if (error) { /* error < 0, but ia64 uses > 0 return value */ regs->r8 = -error; regs->r10 = -1; } else { regs->r8 = val; regs->r10 = 0; } } Tested on v5.10 on rx3600 machine (ia64 9040 CPU). Link: https://lkml.kernel.org/r/20210221002554.333076-2-slyfox@gentoo.org Link: https://bugs.gentoo.org/769614 Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Reported-by: Dmitry V. Levin <ldv@altlinux.org> Reviewed-by: Dmitry V. Levin <ldv@altlinux.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13ia64: fix ia64_syscall_get_set_arguments() for break-based syscallsSergei Trofimovich1-6/+18
In https://bugs.gentoo.org/769614 Dmitry noticed that `ptrace(PTRACE_GET_SYSCALL_INFO)` does not work for syscalls called via glibc's syscall() wrapper. ia64 has two ways to call syscalls from userspace: via `break` and via `eps` instructions. The difference is in stack layout: 1. `eps` creates simple stack frame: no locals, in{0..7} == out{0..8} 2. `break` uses userspace stack frame: may be locals (glibc provides one), in{0..7} == out{0..8}. Both work fine in syscall handling cde itself. But `ptrace(PTRACE_GET_SYSCALL_INFO)` uses unwind mechanism to re-extract syscall arguments but it does not account for locals. The change always skips locals registers. It should not change `eps` path as kernel's handler already enforces locals=0 and fixes `break`. Tested on v5.10 on rx3600 machine (ia64 9040 CPU). Link: https://lkml.kernel.org/r/20210221002554.333076-1-slyfox@gentoo.org Link: https://bugs.gentoo.org/769614 Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Reported-by: Dmitry V. Levin <ldv@altlinux.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13mm/userfaultfd: fix memory corruption due to writeprotectNadav Amit1-0/+8
Userfaultfd self-test fails occasionally, indicating a memory corruption. Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range() and defers flushes, and since there is insufficient consideration of concurrent deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from the TLBs in wp_page_copy(), this flush takes place after the copy has already been performed, and therefore changes of the page are possible between the time of the copy and the time in which the PTE is flushed. To make matters worse, memory-unprotection using userfaultfd also poses a problem. Although memory unprotection is logically a promotion of PTE permissions, and therefore should not require a TLB flush, the current userrfaultfd code might actually cause a demotion of the architectural PTE permission: when userfaultfd_writeprotect() unprotects memory region, it unintentionally *clears* the RW-bit if it was already set. Note that this unprotecting a PTE that is not write-protected is a valid use-case: the userfaultfd monitor might ask to unprotect a region that holds both write-protected and write-unprotected PTEs. The scenario that happens in selftests/vm/userfaultfd is as follows: cpu0 cpu1 cpu2 ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-*unprotect* ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ *clear* “write”-bit ] [ defer TLB flushes ] [ page-fault ] ... wp_page_copy() cow_user_page() [ copy page ] [ write to old page ] ... set_pte_at_notify() A similar scenario can happen: cpu0 cpu1 cpu2 cpu3 ---- ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-protect ] [ deferred TLB flush ] userfaultfd_writeprotect() [ write-unprotect ] [ deferred TLB flush] [ page-fault ] wp_page_copy() cow_user_page() [ copy page ] ... [ write to page ] set_pte_at_notify() This race exists since commit 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit"). Yet, as Yu Zhao pointed, these races became apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made wp_page_copy() more likely to take place, specifically if page_count(page) > 1. To resolve the aforementioned races, check whether there are pending flushes on uffd-write-protected VMAs, and if there are, perform a flush before doing the COW. Further optimizations will follow to avoid during uffd-write-unprotect unnecassary PTE write-protection and TLB flushes. Link: https://lkml.kernel.org/r/20210304095423.3825684-1-namit@vmware.com Fixes: 09854ba94c6a ("mm: do_wp_page() simplification") Signed-off-by: Nadav Amit <namit@vmware.com> Suggested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Tested-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> [5.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13kasan: fix KASAN_STACK dependency for HW_TAGSAndrey Konovalov1-0/+1
There's a runtime failure when running HW_TAGS-enabled kernel built with GCC on hardware that doesn't support MTE. GCC-built kernels always have CONFIG_KASAN_STACK enabled, even though stack instrumentation isn't supported by HW_TAGS. Having that config enabled causes KASAN to issue MTE-only instructions to unpoison kernel stacks, which causes the failure. Fix the issue by disallowing CONFIG_KASAN_STACK when HW_TAGS is used. (The commit that introduced CONFIG_KASAN_HW_TAGS specified proper dependency for CONFIG_KASAN_STACK_ENABLE but not for CONFIG_KASAN_STACK.) Link: https://lkml.kernel.org/r/59e75426241dbb5611277758c8d4d6f5f9298dac.1615215441.git.andreyknvl@google.com Fixes: 6a63a63ff1ac ("kasan: introduce CONFIG_KASAN_HW_TAGS") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOCAndrey Konovalov1-2/+6
Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called after debug_pagealloc_unmap_pages(). This causes a crash when debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an unmapped page. This patch puts kasan_free_nondeferred_pages() before debug_pagealloc_unmap_pages() and arch_free_page(), which can also make the page unavailable. Link: https://lkml.kernel.org/r/24cd7db274090f0e5bc3adcdc7399243668e3171.1614987311.git.andreyknvl@google.com Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13mm/madvise: replace ptrace attach requirement for process_madviseSuren Baghdasaryan1-1/+12
process_madvise currently requires ptrace attach capability. PTRACE_MODE_ATTACH gives one process complete control over another process. It effectively removes the security boundary between the two processes (in one direction). Granting ptrace attach capability even to a system process is considered dangerous since it creates an attack surface. This severely limits the usage of this API. The operations process_madvise can perform do not affect the correctness of the operation of the target process; they only affect where the data is physically located (and therefore, how fast it can be accessed). What we want is the ability for one process to influence another process in order to optimize performance across the entire system while leaving the security boundary intact. Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata and CAP_SYS_NICE for influencing process performance. Link: https://lkml.kernel.org/r/20210303185807.2160264-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <stable@vger.kernel.org> [5.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13include/linux/sched/mm.h: use rcu_dereference in in_vfork()Matthew Wilcox (Oracle)1-1/+2
Fix a sparse warning by using rcu_dereference(). Technically this is a bug and a sufficiently aggressive compiler could reload the `real_parent' pointer outside the protection of the rcu lock (and access freed memory), but I think it's pretty unlikely to happen. Link: https://lkml.kernel.org/r/20210221194207.1351703-1-willy@infradead.org Fixes: b18dc5f291c0 ("mm, oom: skip vforked tasks from being selected") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13kfence: fix reports if constant function prefixes existMarco Elver1-6/+12
Some architectures prefix all functions with a constant string ('.' on ppc64). Add ARCH_FUNC_PREFIX, which may optionally be defined in <asm/kfence.h>, so that get_stack_skipnr() can work properly. Link: https://lkml.kernel.org/r/f036c53d-7e81-763c-47f4-6024c6c5f058@csgroup.eu Link: https://lkml.kernel.org/r/20210304144000.1148590-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocationsMarco Elver1-1/+1
cache_alloc_debugcheck_after() performs checks on an object, including adjusting the returned pointer. None of this should apply to KFENCE objects. While for non-bulk allocations, the checks are skipped when we allocate via KFENCE, for bulk allocations cache_alloc_debugcheck_after() is called via cache_alloc_debugcheck_after_bulk(). Fix it by skipping cache_alloc_debugcheck_after() for KFENCE objects. Link: https://lkml.kernel.org/r/20210304205256.2162309-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13kfence: fix printk format for ptrdiff_tMarco Elver1-6/+6
Use %td for ptrdiff_t. Link: https://lkml.kernel.org/r/3abbe4c9-16ad-c168-a90f-087978ccd8f7@csgroup.eu Link: https://lkml.kernel.org/r/20210303121157.3430807-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Jann Horn <jannh@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*Arnd Bergmann1-0/+6
Separating compiler-clang.h from compiler-gcc.h inadventently dropped the definitions of the three HAVE_BUILTIN_BSWAP macros, which requires falling back to the open-coded version and hoping that the compiler detects it. Since all versions of clang support the __builtin_bswap interfaces, add back the flags and have the headers pick these up automatically. This results in a 4% improvement of compilation speed for arm defconfig. Note: it might also be worth revisiting which architectures set CONFIG_ARCH_USE_BUILTIN_BSWAP for one compiler or the other, today this is set on six architectures (arm32, csky, mips, powerpc, s390, x86), while another ten architectures define custom helpers (alpha, arc, ia64, m68k, mips, nios2, parisc, sh, sparc, xtensa), and the rest (arm64, h8300, hexagon, microblaze, nds32, openrisc, riscv) just get the unoptimized version and rely on the compiler to detect it. A long time ago, the compiler builtins were architecture specific, but nowadays, all compilers that are able to build the kernel have correct implementations of them, though some may not be as optimized as the inline asm versions. The patch that dropped the optimization landed in v4.19, so as discussed it would be fairly safe to backport this revert to stable kernels to the 4.19/5.4/5.10 stable kernels, but there is a remaining risk for regressions, and it has no known side-effects besides compile speed. Link: https://lkml.kernel.org/r/20210226161151.2629097-1-arnd@kernel.org Link: https://lore.kernel.org/lkml/20210225164513.3667778-1-arnd@kernel.org/ Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Miguel Ojeda <ojeda@kernel.org> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Guo Ren <guoren@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Marco Elver <elver@google.com> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13MAINTAINERS: exclude uapi directories in API/ABI sectionVlastimil Babka1-2/+2
Commit 7b4693e644cb ("MAINTAINERS: add uapi directories to API/ABI section") added include/uapi/ and arch/*/include/uapi/ so that patches modifying them CC linux-api. However that was already done in the past and resulted in too much noise and thus later removed, as explained in b14fd334ff3d ("MAINTAINERS: trim the file triggers for ABI/API") To prevent another round of addition and removal in the future, change the entries to X: (explicit exclusion) for documentation purposes, although they are not subdirectories of broader included directories, as there is apparently no defined way to add plain comments in subsystem sections. Link: https://lkml.kernel.org/r/20210301100255.25229-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13binfmt_misc: fix possible deadlock in bm_register_writeLior Ribak1-15/+14
There is a deadlock in bm_register_write: First, in the begining of the function, a lock is taken on the binfmt_misc root inode with inode_lock(d_inode(root)). Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call open_exec on the user-provided interpreter. open_exec will call a path lookup, and if the path lookup process includes the root of binfmt_misc, it will try to take a shared lock on its inode again, but it is already locked, and the code will get stuck in a deadlock To reproduce the bug: $ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" > /proc/sys/fs/binfmt_misc/register backtrace of where the lock occurs (#5): 0 schedule () at ./arch/x86/include/asm/current.h:15 1 0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=<optimized out>, state=state@entry=2) at kernel/locking/rwsem.c:992 2 0xffffffff81b5150a in __down_read_common (state=2, sem=<optimized out>) at kernel/locking/rwsem.c:1213 3 __down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1222 4 down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1355 5 0xffffffff811ee22a in inode_lock_shared (inode=<optimized out>) at ./include/linux/fs.h:783 6 open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177 7 path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366 8 0xffffffff811efe1c in do_filp_open (dfd=<optimized out>, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396 9 0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=<optimized out>, flags@entry=0) at fs/exec.c:913 10 0xffffffff811e4a92 in open_exec (name=<optimized out>) at fs/exec.c:948 11 0xffffffff8124aa84 in bm_register_write (file=<optimized out>, buffer=<optimized out>, count=19, ppos=<optimized out>) at fs/binfmt_misc.c:682 12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF ", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603 13 0xffffffff811defda in ksys_write (fd=<optimized out>, buf=0xa758d0 ":iiiii:E::ii::i:CF ", count=19) at fs/read_write.c:658 14 0xffffffff81b49813 in do_syscall_64 (nr=<optimized out>, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46 15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120 To solve the issue, the open_exec call is moved to before the write lock is taken by bm_register_write Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com Fixes: 948b701a607f1 ("binfmt_misc: add persistent opened binary handler for containers") Signed-off-by: Lior Ribak <liorribak@gmail.com> Acked-by: Helge Deller <deller@gmx.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13mm/highmem.c: fix zero_user_segments() with start > endOGAWA Hirofumi1-5/+12
zero_user_segments() is used from __block_write_begin_int(), for example like the following zero_user_segments(page, 4096, 1024, 512, 918) But new the zero_user_segments() implementation for for HIGHMEM + TRANSPARENT_HUGEPAGE doesn't handle "start > end" case correctly, and hits BUG_ON(). (we can fix __block_write_begin_int() instead though, it is the old and multiple usage) Also it calls kmap_atomic() unnecessarily while start == end == 0. Link: https://lkml.kernel.org/r/87v9ab60r4.fsf@mail.parknet.co.jp Fixes: 0060ef3b4e6d ("mm: support THPs in zero_user_segments") Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13hugetlb: do early cow when page pinned on src mmPeter Xu1-4/+62
This is the last missing piece of the COW-during-fork effort when there're pinned pages found. One can reference 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes", 2020-09-27) for more information, since we do similar things here rather than pte this time, but just for hugetlb. Note that after Jason's recent work on 57efa1fe5957 ("mm/gup: prevent gup_fast from racing with COW during fork", 2020-12-15) which is safer and easier to understand, we're safe now within the whole copy_page_range() against gup-fast, we don't need the wr-protect trick that proposed in 70e806e4e645 anymore. Link: https://lkml.kernel.org/r/20210217233547.93892-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Gal Pressman <galpress@amazon.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Roland Scheidegger <sroland@vmware.com> Cc: VMware Graphics <linux-graphics-maintainer@vmware.com> Cc: Wei Zhang <wzam@amazon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13mm: use is_cow_mapping() across tree where properPeter Xu4-9/+3
After is_cow_mapping() is exported in mm.h, replace some manual checks elsewhere throughout the tree but start to use the new helper. Link: https://lkml.kernel.org/r/20210217233547.93892-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> Cc: VMware Graphics <linux-graphics-maintainer@vmware.com> Cc: Roland Scheidegger <sroland@vmware.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Gal Pressman <galpress@amazon.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Wei Zhang <wzam@amazon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>