summaryrefslogtreecommitdiffstats
path: root/security (unfollow)
Commit message (Collapse)AuthorFilesLines
2022-04-22objtool: Fix code relocs vs weak symbolsPeter Zijlstra1-22/+165
Occasionally objtool driven code patching (think .static_call_sites .retpoline_sites etc..) goes sideways and it tries to patch an instruction that doesn't match. Much head-scatching and cursing later the problem is as outlined below and affects every section that objtool generates for us, very much including the ORC data. The below uses .static_call_sites because it's convenient for demonstration purposes, but as mentioned the ORC sections, .retpoline_sites and __mount_loc are all similarly affected. Consider: foo-weak.c: extern void __SCT__foo(void); __attribute__((weak)) void foo(void) { return __SCT__foo(); } foo.c: extern void __SCT__foo(void); extern void my_foo(void); void foo(void) { my_foo(); return __SCT__foo(); } These generate the obvious code (gcc -O2 -fcf-protection=none -fno-asynchronous-unwind-tables -c foo*.c): foo-weak.o: 0000000000000000 <foo>: 0: e9 00 00 00 00 jmpq 5 <foo+0x5> 1: R_X86_64_PLT32 __SCT__foo-0x4 foo.o: 0000000000000000 <foo>: 0: 48 83 ec 08 sub $0x8,%rsp 4: e8 00 00 00 00 callq 9 <foo+0x9> 5: R_X86_64_PLT32 my_foo-0x4 9: 48 83 c4 08 add $0x8,%rsp d: e9 00 00 00 00 jmpq 12 <foo+0x12> e: R_X86_64_PLT32 __SCT__foo-0x4 Now, when we link these two files together, you get something like (ld -r -o foos.o foo-weak.o foo.o): foos.o: 0000000000000000 <foo-0x10>: 0: e9 00 00 00 00 jmpq 5 <foo-0xb> 1: R_X86_64_PLT32 __SCT__foo-0x4 5: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1) f: 90 nop 0000000000000010 <foo>: 10: 48 83 ec 08 sub $0x8,%rsp 14: e8 00 00 00 00 callq 19 <foo+0x9> 15: R_X86_64_PLT32 my_foo-0x4 19: 48 83 c4 08 add $0x8,%rsp 1d: e9 00 00 00 00 jmpq 22 <foo+0x12> 1e: R_X86_64_PLT32 __SCT__foo-0x4 Noting that ld preserves the weak function text, but strips the symbol off of it (hence objdump doing that funny negative offset thing). This does lead to 'interesting' unused code issues with objtool when ran on linked objects, but that seems to be working (fingers crossed). So far so good.. Now lets consider the objtool static_call output section (readelf output, old binutils): foo-weak.o: Relocation section '.rela.static_call_sites' at offset 0x2c8 contains 1 entry: Offset Info Type Symbol's Value Symbol's Name + Addend 0000000000000000 0000000200000002 R_X86_64_PC32 0000000000000000 .text + 0 0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1 foo.o: Relocation section '.rela.static_call_sites' at offset 0x310 contains 2 entries: Offset Info Type Symbol's Value Symbol's Name + Addend 0000000000000000 0000000200000002 R_X86_64_PC32 0000000000000000 .text + d 0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1 foos.o: Relocation section '.rela.static_call_sites' at offset 0x430 contains 4 entries: Offset Info Type Symbol's Value Symbol's Name + Addend 0000000000000000 0000000100000002 R_X86_64_PC32 0000000000000000 .text + 0 0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1 0000000000000008 0000000100000002 R_X86_64_PC32 0000000000000000 .text + 1d 000000000000000c 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1 So we have two patch sites, one in the dead code of the weak foo and one in the real foo. All is well. *HOWEVER*, when the toolchain strips unused section symbols it generates things like this (using new enough binutils): foo-weak.o: Relocation section '.rela.static_call_sites' at offset 0x2c8 contains 1 entry: Offset Info Type Symbol's Value Symbol's Name + Addend 0000000000000000 0000000200000002 R_X86_64_PC32 0000000000000000 foo + 0 0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1 foo.o: Relocation section '.rela.static_call_sites' at offset 0x310 contains 2 entries: Offset Info Type Symbol's Value Symbol's Name + Addend 0000000000000000 0000000200000002 R_X86_64_PC32 0000000000000000 foo + d 0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1 foos.o: Relocation section '.rela.static_call_sites' at offset 0x430 contains 4 entries: Offset Info Type Symbol's Value Symbol's Name + Addend 0000000000000000 0000000100000002 R_X86_64_PC32 0000000000000000 foo + 0 0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1 0000000000000008 0000000100000002 R_X86_64_PC32 0000000000000000 foo + d 000000000000000c 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1 And now we can see how that foos.o .static_call_sites goes side-ways, we now have _two_ patch sites in foo. One for the weak symbol at foo+0 (which is no longer a static_call site!) and one at foo+d which is in fact the right location. This seems to happen when objtool cannot find a section symbol, in which case it falls back to any other symbol to key off of, however in this case that goes terribly wrong! As such, teach objtool to create a section symbol when there isn't one. Fixes: 44f6a7c0755d ("objtool: Fix seg fault with Clang non-section symbols") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20220419203807.655552918@infradead.org
2022-04-22objtool: Fix type of reloc::addendPeter Zijlstra3-7/+7
Elf{32,64}_Rela::r_addend is of type: Elf{32,64}_Sword, that means that our reloc::addend needs to be long or face tuncation issues when we do elf_rebuild_reloc_section(): - 107: 48 b8 00 00 00 00 00 00 00 00 movabs $0x0,%rax 109: R_X86_64_64 level4_kernel_pgt+0x80000067 + 107: 48 b8 00 00 00 00 00 00 00 00 movabs $0x0,%rax 109: R_X86_64_64 level4_kernel_pgt-0x7fffff99 Fixes: 627fce14809b ("objtool: Add ORC unwind table generation") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20220419203807.596871927@infradead.org
2022-04-19objtool: Fix function fallthrough detection for vmlinuxJosh Poimboeuf3-3/+2
Objtool's function fallthrough detection only works on C objects. The distinction between C and assembly objects no longer makes sense with objtool running on vmlinux.o. Now that copy_user_64.S has been fixed up, and an objtool sibling call detection bug has been fixed, the asm code is in "compliance" and this hack is no longer needed. Remove it. Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/b434cff98eca3a60dcc64c620d7d5d405a0f441c.1649718562.git.jpoimboe@redhat.com
2022-04-19objtool: Fix sibling call detection in alternativesJosh Poimboeuf1-19/+17
In add_jump_destinations(), sibling call detection requires 'insn->func' to be valid. But alternative instructions get their 'func' set in handle_group_alt(), which runs *after* add_jump_destinations(). So sibling calls in alternatives code don't get properly detected. Fix that by changing the initialization order: call add_special_section_alts() *before* add_jump_destinations(). This also means the special case for a missing 'jump_dest' in add_jump_destinations() can be removed, as it has already been dealt with. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/c02e0a0a2a4286b5f848d17c77fdcb7e0caf709c.1649718562.git.jpoimboe@redhat.com
2022-04-19objtool: Don't set 'jump_dest' for sibling callsJosh Poimboeuf1-13/+22
For most sibling calls, 'jump_dest' is NULL because objtool treats the jump like a call and sets 'call_dest'. But there are a few edge cases where that's not true. Make it consistent to avoid unexpected behavior. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/8737d6b9d1691831aed73375f444f0f42da3e2c9.1649718562.git.jpoimboe@redhat.com
2022-04-19x86/uaccess: Don't jump between functionsJosh Poimboeuf1-35/+52
For unwinding sanity, a function shouldn't jump to the middle of another function. Move the short string user copy code out to a separate non-function code snippet. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/9519e4853148b765e047967708f2b61e56c93186.1649718562.git.jpoimboe@redhat.com
2022-04-19x86/Kconfig: fix the spelling of 'becoming' in X86_KERNEL_IBT configNur Hussein1-1/+1
There is only one m in becoming. Signed-off-by: Nur Hussein <hussein@unixcat.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220417192454.10247-1-hussein@unixcat.org
2022-04-19objtool: Use offstr() to print address of missing ENDBRJosh Poimboeuf1-3/+2
Fixes: 89bc853eae4a ("objtool: Find unused ENDBR instructions") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/95d12e800c736a3f7d08d61dabb760b2d5251a8e.1650300597.git.jpoimboe@redhat.com
2022-04-19objtool: Print data address for "!ENDBR" data warningsJosh Poimboeuf1-5/+2
When a "!ENDBR" warning is reported for a data section, objtool just prints the text address of the relocation target twice, without giving any clues about the location of the original data reference: vmlinux.o: warning: objtool: dcbnl_netdevice_event()+0x0: .text+0xb64680: data relocation to !ENDBR: dcbnl_netdevice_event+0x0 Instead, print the address of the data reference, in addition to the address of the relocation target. vmlinux.o: warning: objtool: dcbnl_nb+0x0: .data..read_mostly+0xe260: data relocation to !ENDBR: dcbnl_netdevice_event+0x0 Fixes: 89bc853eae4a ("objtool: Find unused ENDBR instructions") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/762e88d51300e8eaf0f933a5b0feae20ac033bea.1650300597.git.jpoimboe@redhat.com
2022-04-19x86/xen: Add ANNOTATE_NOENDBR to startup_xen()Josh Poimboeuf1-0/+1
The startup_xen() kernel entry point is referenced by the ".note.Xen" section, and is the real entry point of the VM. Control transfer is through IRET, which *could* set NEED_ENDBR, however Xen currently does no such thing. Add ANNOTATE_NOENDBR to silence future objtool warnings. Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Link: https://lkml.kernel.org/r/a87bd48b06d11ec4b98122a429e71e489b4e48c3.1650300597.git.jpoimboe@redhat.com
2022-04-19x86/uaccess: Add ENDBR to __put_user_nocheck*()Josh Poimboeuf1-0/+4
The __put_user_nocheck*() inner labels are exported, so in keeping with the "allow exported functions to be indirectly called" policy, add ENDBR. Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/207f02177a23031091d1a608de6049a9e5e8ff80.1650300597.git.jpoimboe@redhat.com
2022-04-19x86/retpoline: Add ANNOTATE_NOENDBR for retpolinesJosh Poimboeuf1-1/+1
The retpolines are exported, so they're referenced by ksymtab sections. But they're never indirect-branched to, so add ANNOTATE_NOENDBR. Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/b6ec963dfd9301b6b1d74ef7758fcb0b540d6c6c.1650300597.git.jpoimboe@redhat.com
2022-04-19x86/static_call: Add ANNOTATE_NOENDBR to static call trampolineJosh Poimboeuf1-0/+1
The static call trampoline is never indirect-branched to, but is referenced by the static call key. Add ANNOTATE_NOENDBR. Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1b5b54aad7d81241dabe5e0c9b40dea64b540b00.1650300597.git.jpoimboe@redhat.com
2022-04-19objtool: Enable unreachable warnings for CLANG LTOJosh Poimboeuf2-2/+2
With IBT support in, objtool is now fully capable of following vmlinux code flow in LTO mode. Start reporting unreachable warnings for Clang LTO as well. Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/7b12df54bceeb0761fe9fc8269ea0c00501214a9.1650300597.git.jpoimboe@redhat.com
2022-04-19x86,objtool: Explicitly mark idtentry_body()s tail REACHABLEPeter Zijlstra1-0/+3
Objtool can figure out that some \cfunc()s are noreturn and then complains about certain instances having unreachable tails: vmlinux.o: warning: objtool: asm_exc_xen_unknown_trap()+0x16: unreachable instruction Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220408094718.441854969@infradead.org
2022-04-19x86,objtool: Mark cpu_startup_entry() __noreturnPeter Zijlstra2-1/+2
GCC-8 isn't clever enough to figure out that cpu_start_entry() is a noreturn while objtool is. This results in code after the call in start_secondary(). Give GCC a hand so that they all agree on things. vmlinux.o: warning: objtool: start_secondary()+0x10e: unreachable Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220408094718.383658532@infradead.org
2022-04-19x86,xen,objtool: Add UNWIND hintPeter Zijlstra1-0/+1
SYM_CODE_START*() doesn't get auto-validated and needs an UNWIND hint to get checked, add one. vmlinux.o: warning: objtool: pvh_start_xen()+0x0: unreachable Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220408094718.321246297@infradead.org
2022-04-19lib/strn*,objtool: Enforce user_access_begin() rulesPeter Zijlstra2-2/+2
Apparently GCC can fail to inline a 'static inline' single caller function: lib/strnlen_user.o: warning: objtool: strnlen_user()+0x33: call to do_strnlen_user() with UACCESS enabled lib/strncpy_from_user.o: warning: objtool: strncpy_from_user()+0x33: call to do_strncpy_from_user() with UACCESS enabled Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220408094718.262932488@infradead.org
2022-04-19MAINTAINERS: Add x86 unwinding entryJosh Poimboeuf1-0/+9
Create a new section for x86 unwinder maintenance. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/db2b764b735a9481df9f7717a3a1f75ba496fcc1.1650387176.git.jpoimboe@redhat.com Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-04-19x86/unwind/orc: Recheck address range after stack info was updatedDmitry Monakhov1-4/+4
A crash was observed in the ORC unwinder: BUG: stack guard page was hit at 000000000dd984a2 (stack is 00000000d1caafca..00000000613712f0) kernel stack overflow (page fault): 0000 [#1] SMP NOPTI CPU: 93 PID: 23787 Comm: context_switch1 Not tainted 5.4.145 #1 RIP: 0010:unwind_next_frame Call Trace: <NMI> perf_callchain_kernel get_perf_callchain perf_callchain perf_prepare_sample perf_event_output_forward __perf_event_overflow perf_ibs_handle_irq perf_ibs_nmi_handler nmi_handle default_do_nmi do_nmi end_repeat_nmi This was really two bugs: 1) The perf IBS code passed inconsistent regs to the unwinder. 2) The unwinder didn't handle the bad input gracefully. Fix the latter bug. The ORC unwinder needs to be immune against bad inputs. The problem is that stack_access_ok() doesn't recheck the validity of the full range of registers after switching to the next valid stack with get_stack_info(). Fix that. [ jpoimboe: rewrote commit log ] Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/1650353656-956624-1-git-send-email-dmtrmonakhov@yandex-team.ru Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-04-19x86/cpu: Load microcode during restore_processor_state()Borislav Petkov3-4/+14
When resuming from system sleep state, restore_processor_state() restores the boot CPU MSRs. These MSRs could be emulated by microcode. If microcode is not loaded yet, writing to emulated MSRs leads to unchecked MSR access error: ... PM: Calling lapic_suspend+0x0/0x210 unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0...0) at rIP: ... (native_write_msr) Call Trace: <TASK> ? restore_processor_state x86_acpi_suspend_lowlevel acpi_suspend_enter suspend_devices_and_enter pm_suspend.cold state_store kobj_attr_store sysfs_kf_write kernfs_fop_write_iter new_sync_write vfs_write ksys_write __x64_sys_write do_syscall_64 entry_SYSCALL_64_after_hwframe RIP: 0033:0x7fda13c260a7 To ensure microcode emulated MSRs are available for restoration, load the microcode on the boot CPU before restoring these MSRs. [ Pawan: write commit message and productize it. ] Fixes: e2a1256b17b1 ("x86/speculation: Restore speculation related MSRs during S3 resume") Reported-by: Kyle D. Pelton <kyle.d.pelton@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Tested-by: Kyle D. Pelton <kyle.d.pelton@intel.com> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=215841 Link: https://lore.kernel.org/r/4350dfbf785cd482d3fafa72b2b49c83102df3ce.1650386317.git.pawan.kumar.gupta@linux.intel.com
2022-04-19x86/cpu: Add new Alderlake and Raptorlake CPU model numbersTony Luck1-0/+3
Intel is subdividing the mobile segment with additional models with the same codename. Using the Intel "N" and "P" suffices for these will be less confusing than trying to map to some different naming convention. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/YlS7n7Xtso9BXZA2@agluck-desk3.sc.intel.com
2022-04-17Linux 5.18-rc3v5.18-rc3Linus Torvalds1-1/+1
2022-04-16random: use memmove instead of memcpy for remaining 32 bytesJason A. Donenfeld1-1/+1
In order to immediately overwrite the old key on the stack, before servicing a userspace request for bytes, we use the remaining 32 bytes of block 0 as the key. This means moving indices 8,9,a,b,c,d,e,f -> 4,5,6,7,8,9,a,b. Since 4 < 8, for the kernel implementations of memcpy(), this doesn't actually appear to be a problem in practice. But relying on that characteristic seems a bit brittle. So let's change that to a proper memmove(), which is the by-the-books way of handling overlapping memory copies. Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-04-15mm: kmemleak: take a full lowmem check in kmemleak_*_phys()Patrick Wang1-4/+4
The kmemleak_*_phys() apis do not check the address for lowmem's min boundary, while the caller may pass an address below lowmem, which will trigger an oops: # echo scan > /sys/kernel/debug/kmemleak Unable to handle kernel paging request at virtual address ff5fffffffe00000 Oops [#1] Modules linked in: CPU: 2 PID: 134 Comm: bash Not tainted 5.18.0-rc1-next-20220407 #33 Hardware name: riscv-virtio,qemu (DT) epc : scan_block+0x74/0x15c ra : scan_block+0x72/0x15c epc : ffffffff801e5806 ra : ffffffff801e5804 sp : ff200000104abc30 gp : ffffffff815cd4e8 tp : ff60000004cfa340 t0 : 0000000000000200 t1 : 00aaaaaac23954cc t2 : 00000000000003ff s0 : ff200000104abc90 s1 : ffffffff81b0ff28 a0 : 0000000000000000 a1 : ff5fffffffe01000 a2 : ffffffff81b0ff28 a3 : 0000000000000002 a4 : 0000000000000001 a5 : 0000000000000000 a6 : ff200000104abd7c a7 : 0000000000000005 s2 : ff5fffffffe00ff9 s3 : ffffffff815cd998 s4 : ffffffff815d0e90 s5 : ffffffff81b0ff28 s6 : 0000000000000020 s7 : ffffffff815d0eb0 s8 : ffffffffffffffff s9 : ff5fffffffe00000 s10: ff5fffffffe01000 s11: 0000000000000022 t3 : 00ffffffaa17db4c t4 : 000000000000000f t5 : 0000000000000001 t6 : 0000000000000000 status: 0000000000000100 badaddr: ff5fffffffe00000 cause: 000000000000000d scan_gray_list+0x12e/0x1a6 kmemleak_scan+0x2aa/0x57e kmemleak_write+0x32a/0x40c full_proxy_write+0x56/0x82 vfs_write+0xa6/0x2a6 ksys_write+0x6c/0xe2 sys_write+0x22/0x2a ret_from_syscall+0x0/0x2 The callers may not quite know the actual address they pass(e.g. from devicetree). So the kmemleak_*_phys() apis should guarantee the address they finally use is in lowmem range, so check the address for lowmem's min boundary. Link: https://lkml.kernel.org/r/20220413122925.33856-1-patrick.wang.shcn@gmail.com Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcoreOmar Sandoval3-14/+0
Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1, ensuring that any future vunmaps() immediately purge the vmap areas instead of doing it lazily. Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller context") moved the purging from the vunmap() caller to a worker thread. Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin (possibly forever). For example, consider the following scenario: 1. Thread reads from /proc/vmcore. This eventually calls __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1. 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2 pages (one page plus the guard page) to the purge list and vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the drain_vmap_work is scheduled. 3. Thread returns from the kernel and is scheduled out. 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It frees the 2 pages on the purge list. vmap_lazy_nr is now lazy_max_pages() + 1. 5. This is still over the threshold, so it tries to purge areas again, but doesn't find anything. 6. Repeat 5. If the system is running with only one CPU (which is typicial for kdump) and preemption is disabled, then this will never make forward progress: there aren't any more pages to purge, so it hangs. If there is more than one CPU or preemption is enabled, then the worker thread will spin forever in the background. (Note that if there were already pages to be purged at the time that set_iounmap_nonlazy() was called, this bug is avoided.) This can be reproduced with anything that reads from /proc/vmcore multiple times. E.g., vmcore-dmesg /proc/vmcore. It turns out that improvements to vmap() over the years have obsoleted the need for this "optimization". I benchmarked `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed entirely: |5.17 |5.18+fix|5.18+removal 4k|40.86s| 40.09s| 26.73s 1M|24.47s| 23.98s| 21.84s The removal was the fastest (by a wide margin with 4k reads). This patch removes set_iounmap_nonlazy(). Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com Fixes: 690467c81b1a ("mm/vmalloc: Move draining areas out of caller context") Signed-off-by: Omar Sandoval <osandov@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"Andrew Morton1-2/+2
Despite Mike's attempted fix (925346c129da117122), regressions reports continue: https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/ https://bugzilla.kernel.org/show_bug.cgi?id=215720 https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info So revert this patch. Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE") Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fangrui Song <maskray@google.com> Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"Andrew Morton1-1/+1
Commit 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") was an attempt to fix regressions due to 9630f0d60fec5f ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE"). But regressionss continue to be reported: https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/ https://bugzilla.kernel.org/show_bug.cgi?id=215720 https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info This patch reverts the fix, so the original can also be reverted. Fixes: 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Song Liu <songliubraving@fb.com> Cc: David Rientjes <rientjes@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Fangrui Song <maskray@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15hugetlb: do not demote poisoned hugetlb pagesMike Kravetz1-7/+10
It is possible for poisoned hugetlb pages to reside on the free lists. The huge page allocation routines which dequeue entries from the free lists make a point of avoiding poisoned pages. There is no such check and avoidance in the demote code path. If a hugetlb page on the is on a free list, poison will only be set in the head page rather then the page with the actual error. If such a page is demoted, then the poison flag may follow the wrong page. A page without error could have poison set, and a page with poison could not have the flag set. Check for poison before attempting to demote a hugetlb page. Also, return -EBUSY to the caller if only poisoned pages are on the free list. Link: https://lkml.kernel.org/r/20220307215707.50916-1-mike.kravetz@oracle.com Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15mm: compaction: fix compiler warning when CONFIG_COMPACTION=nCharan Teja Kalla1-5/+5
The below warning is reported when CONFIG_COMPACTION=n: mm/compaction.c:56:27: warning: 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' defined but not used [-Wunused-const-variable=] 56 | static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fix it by moving 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' under CONFIG_COMPACTION defconfig. Also since this is just a 'static const int' type, use #define for it. Link: https://lkml.kernel.org/r/1647608518-20924-1-git-send-email-quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15mm: fix unexpected zeroed page mapping with zram swapMinchan Kim1-54/+0
Two processes under CLONE_VM cloning, user process can be corrupted by seeing zeroed page unexpectedly. CPU A CPU B do_swap_page do_swap_page SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path swap_readpage valid data swap_slot_free_notify delete zram entry swap_readpage zeroed(invalid) data pte_lock map the *zero data* to userspace pte_unlock pte_lock if (!pte_same) goto out_nomap; pte_unlock return and next refault will read zeroed data The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't increase the refcount of swap slot at copy_mm so it couldn't catch up whether it's safe or not to discard data from backing device. In the case, only the lock it could rely on to synchronize swap slot freeing is page table lock. Thus, this patch gets rid of the swap_slot_free_notify function. With this patch, CPU A will see correct data. CPU A CPU B do_swap_page do_swap_page SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path swap_readpage original data pte_lock map the original data swap_free swap_range_free bd_disk->fops->swap_slot_free_notify swap_readpage read zeroed data pte_unlock pte_lock if (!pte_same) goto out_nomap; pte_unlock return on next refault will see mapped data by CPU B The concern of the patch would increase memory consumption since it could keep wasted memory with compressed form in zram as well as uncompressed form in address space. However, most of cases of zram uses no readahead and do_swap_page is followed by swap_free so it will free the compressed form from in zram quickly. Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") Reported-by: Ivan Babrou <ivan@cloudflare.com> Tested-by: Ivan Babrou <ivan@cloudflare.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15mm, page_alloc: fix build_zonerefs_node()Juergen Gross1-1/+1
Since commit 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") only zones with free memory are included in a built zonelist. This is problematic when e.g. all memory of a zone has been ballooned out when zonelists are being rebuilt. The decision whether to rebuild the zonelists when onlining new memory is done based on populated_zone() returning 0 for the zone the memory will be added to. The new zone is added to the zonelists only, if it has free memory pages (managed_zone() returns a non-zero value) after the memory has been onlined. This implies, that onlining memory will always free the added pages to the allocator immediately, but this is not true in all cases: when e.g. running as a Xen guest the onlined new memory will be added only to the ballooned memory list, it will be freed only when the guest is being ballooned up afterwards. Another problem with using managed_zone() for the decision whether a zone is being added to the zonelists is, that a zone with all memory used will in fact be removed from all zonelists in case the zonelists happen to be rebuilt. Use populated_zone() when building a zonelist as it has been done before that commit. There was a report that QubesOS (based on Xen) is hitting this problem. Xen has switched to use the zone device functionality in kernel 5.9 and QubesOS wants to use memory hotplugging for guests in order to be able to start a guest with minimal memory and expand it as needed. This was the report leading to the patch. Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15mm, kfence: support kmem_dump_obj() for KFENCE objectsMarco Elver9-25/+105
Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been producing garbage data due to the object not actually being maintained by SLAB or SLUB. Fix this by implementing __kfence_obj_info() that copies relevant information to struct kmem_obj_info when the object was allocated by KFENCE; this is called by a common kmem_obj_info(), which also calls the slab/slub/slob specific variant now called __kmem_obj_info(). For completeness, kmem_dump_obj() now displays if the object was allocated by KFENCE. Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com Fixes: b89fb5ef0ce6 ("mm, kfence: insert KFENCE hooks for SLUB") Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB") Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> [slab] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15kasan: fix hw tags enablement when KUNIT tests are disabledVincenzo Frascino2-6/+9
Kasan enables hw tags via kasan_enable_tagging() which based on the mode passed via kernel command line selects the correct hw backend. kasan_enable_tagging() is meant to be invoked indirectly via the cpu features framework of the architectures that support these backends. Currently the invocation of this function is guarded by CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend only when KUNIT tests are enabled in the kernel. This inconsistency was introduced in commit: ed6d74446cbf ("kasan: test: support async (again) and asymm modes for HW_TAGS") ... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are disabled. Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not prevent the correct invocation of kasan_enable_tagging(). Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com Fixes: ed6d74446cbf ("kasan: test: support async (again) and asymm modes for HW_TAGS") Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15irq_work: use kasan_record_aux_stack_noalloc() record callstackZqiang1-1/+1
On PREEMPT_RT kernel and KASAN is enabled. the kasan_record_aux_stack() may call alloc_pages(), and the rt-spinlock will be acquired, if currently in atomic context, will trigger warning: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 239, name: bootlogd Preemption disabled at: [<ffffffffbab1a531>] rt_mutex_slowunlock+0xa1/0x4e0 CPU: 3 PID: 239 Comm: bootlogd Tainted: G W 5.17.1-rt17-yocto-preempt-rt+ #105 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace: __might_resched.cold+0x13b/0x173 rt_spin_lock+0x5b/0xf0 get_page_from_freelist+0x20c/0x1610 __alloc_pages+0x25e/0x5e0 __stack_depot_save+0x3c0/0x4a0 kasan_save_stack+0x3a/0x50 __kasan_record_aux_stack+0xb6/0xc0 kasan_record_aux_stack+0xe/0x10 irq_work_queue_on+0x6a/0x1c0 pull_rt_task+0x631/0x6b0 do_balance_callbacks+0x56/0x80 __balance_callbacks+0x63/0x90 rt_mutex_setprio+0x349/0x880 rt_mutex_slowunlock+0x22a/0x4e0 rt_spin_unlock+0x49/0x80 uart_write+0x186/0x2b0 do_output_char+0x2e9/0x3a0 n_tty_write+0x306/0x800 file_tty_write.isra.0+0x2af/0x450 tty_write+0x22/0x30 new_sync_write+0x27c/0x3a0 vfs_write+0x3f7/0x5d0 ksys_write+0xd9/0x180 __x64_sys_write+0x43/0x50 do_syscall_64+0x44/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fix it by using kasan_record_aux_stack_noalloc() to avoid the call to alloc_pages(). Link: https://lkml.kernel.org/r/20220402142555.2699582-1-qiang1.zhang@intel.com Signed-off-by: Zqiang <qiang1.zhang@intel.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15mm/secretmem: fix panic when growing a memfd_secretAxel Rasmussen1-0/+17
When one tries to grow an existing memfd_secret with ftruncate, one gets a panic [1]. For example, doing the following reliably induces the panic: fd = memfd_secret(); ftruncate(fd, 10); ptr = mmap(NULL, 10, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); strcpy(ptr, "123456789"); munmap(ptr, 10); ftruncate(fd, 20); The basic reason for this is, when we grow with ftruncate, we call down into simple_setattr, and then truncate_inode_pages_range, and eventually we try to zero part of the memory. The normal truncation code does this via the direct map (i.e., it calls page_address() and hands that to memset()). For memfd_secret though, we specifically don't map our pages via the direct map (i.e. we call set_direct_map_invalid_noflush() on every fault). So the address returned by page_address() isn't useful, and when we try to memset() with it we panic. This patch avoids the panic by implementing a custom setattr for memfd_secret, which detects resizes specifically (setting the size for the first time works just fine, since there are no existing pages to try to zero), and rejects them with EINVAL. One could argue growing should be supported, but I think that will require a significantly more lengthy change. So, I propose a minimal fix for the benefit of stable kernels, and then perhaps to extend memfd_secret to support growing in a separate patch. [1]: BUG: unable to handle page fault for address: ffffa0a889277028 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD afa01067 P4D afa01067 PUD 83f909067 PMD 83f8bf067 PTE 800ffffef6d88060 Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 0 PID: 281 Comm: repro Not tainted 5.17.0-dbg-DEV #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:memset_erms+0x9/0x10 Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01 RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8 RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028 RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028 R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0 R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8 FS: 00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? zero_user_segments+0x82/0x190 truncate_inode_partial_folio+0xd4/0x2a0 truncate_inode_pages_range+0x380/0x830 truncate_setsize+0x63/0x80 simple_setattr+0x37/0x60 notify_change+0x3d8/0x4d0 do_sys_ftruncate+0x162/0x1d0 __x64_sys_ftruncate+0x1c/0x20 do_syscall_64+0x44/0xa0 entry_SYSCALL_64_after_hwframe+0x44/0xae Modules linked in: xhci_pci xhci_hcd virtio_net net_failover failover virtio_blk virtio_balloon uhci_hcd ohci_pci ohci_hcd evdev ehci_pci ehci_hcd 9pnet_virtio 9p netfs 9pnet CR2: ffffa0a889277028 [lkp@intel.com: secretmem_iops can be static] Signed-off-by: kernel test robot <lkp@intel.com> [axelrasmussen@google.com: return EINVAL] Link: https://lkml.kernel.org/r/20220324210909.1843814-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20220412193023.279320-1-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15tmpfs: fix regressions from wider use of ZERO_PAGEHugh Dickins2-17/+20
Chuck Lever reported fsx-based xfstests generic 075 091 112 127 failing when 5.18-rc1 NFS server exports tmpfs: bisected to recent tmpfs change. Whilst nfsd_splice_action() does contain some questionable handling of repeated pages, and Chuck was able to work around there, history from Mark Hemment makes clear that there might be similar dangers elsewhere: it was not a good idea for me to pass ZERO_PAGE down to unknown actors. Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when iter_is_iovec(); in other cases, use the more natural iov_iter_zero() instead of copy_page_to_iter(). We would use iov_iter_zero() throughout, but the x86 clear_user() is not nearly so well optimized as copy to user (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds). And now pagecache_init() does not need to SetPageUptodate(ZERO_PAGE(0)): which had caused boot failure on arm noMMU STM32F7 and STM32H7 boards Link: https://lkml.kernel.org/r/9a978571-8648-e830-5735-1f4748ce2e30@google.com Fixes: 56a8c8eb1eaf ("tmpfs: do not allocate pages on read") Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Patrice CHOTARD <patrice.chotard@foss.st.com> Reported-by: Chuck Lever III <chuck.lever@oracle.com> Tested-by: Chuck Lever III <chuck.lever@oracle.com> Cc: Mark Hemment <markhemm@googlemail.com> Cc: Patrice CHOTARD <patrice.chotard@foss.st.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Lukas Czerner <lczerner@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15MAINTAINERS: Broadcom internal lists aren't maintainersJoe Perches1-32/+32
Convert the broadcom internal list M: and L: entries to R: as exploder email addresses are neither maintainers nor mailing lists. Reorder the entries as necessary. Link: https://lkml.kernel.org/r/04eb301f5b3adbefdd78e76657eff0acb3e3d87f.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15i2c: ismt: Fix undefined behavior due to shift overflowing the constantBorislav Petkov1-2/+2
Fix: drivers/i2c/busses/i2c-ismt.c: In function ‘ismt_hw_init’: drivers/i2c/busses/i2c-ismt.c:770:2: error: case label does not reduce to an integer constant case ISMT_SPGT_SPD_400K: ^~~~ drivers/i2c/busses/i2c-ismt.c:773:2: error: case label does not reduce to an integer constant case ISMT_SPGT_SPD_1M: ^~~~ See https://lore.kernel.org/r/YkwQ6%2BtIH8GQpuct@zn.tnic for the gory details as to why it triggers with older gccs only. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Seth Heasley <seth.heasley@intel.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2022-04-15i2c: dev: Force case user pointers in compat_i2cdev_ioctl()Andy Shevchenko1-1/+1
Sparse has warned us about wrong address space for user pointers: i2c-dev.c:561:50: warning: incorrect type in initializer (different address spaces) i2c-dev.c:561:50: expected unsigned char [usertype] *buf i2c-dev.c:561:50: got void [noderef] __user * Force cast the pointer to (__u8 *) that is used by I²C core code. Note, this is an additional fix to the previously addressed similar issue in the I2C_RDWR case in the same function. Fixes: 3265a7e6b41b ("i2c: dev: Add __user annotation") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2022-04-15i2c: dev: check return value when calling dev_set_name()Andy Shevchenko1-5/+10
If dev_set_name() fails, the dev_name() is null, check the return value of dev_set_name() to avoid the null-ptr-deref. Fixes: 1413ef638aba ("i2c: dev: Fix the race between the release of i2c_dev and cdev") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2022-04-15i2c: qcom-geni: Use dev_err_probe() for GPI DMA errorBjorn Andersson1-4/+2
The GPI DMA engine driver can be compiled as a module, in which case the likely probe deferral "error" shows up in the kernel log. Switch to using dev_err_probe() to silence this warning and to ensure that "devices_deferred" in debugfs carries this information. Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Reviewed-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2022-04-15i2c: imx: Implement errata ERR007805 or e7805 bus frequency limitMarek Vasut1-0/+33
The i.MX8MP Mask Set Errata for Mask 1P33A, Rev. 2.0 has description of errata ERR007805 as below. This errata is found on all MX8M{M,N,P,Q}, MX7{S,D}, MX6{UL{,L,Z},S{,LL,X},S,D,DL,Q,DP,QP} . MX7ULP, MX8Q, MX8X are not affected. MX53 and older status is unknown, as the errata first appears in MX6 errata sheets from 2016 and the latest errata sheet for MX53 is from 2015. Older SoC errata sheets predate the MX53 errata sheet. MX8ULP and MX9 status is unknown as the errata sheet is not available yet. " ERR007805 I2C: When the I2C clock speed is configured for 400 kHz, the SCL low period violates the I2C spec of 1.3 uS min Description: When the I2C module is programmed to operate at the maximum clock speed of 400 kHz (as defined by the I2C spec), the SCL clock low period violates the I2C spec of 1.3 uS min. The user must reduce the clock speed to obtain the SCL low time to meet the 1.3us I2C minimum required. This behavior means the SoC is not compliant to the I2C spec at 400kHz. Workaround: To meet the clock low period requirement in fast speed mode, SCL must be configured to 384KHz or less. " Implement the workaround by matching on the affected SoC specific compatible strings and by limiting the maximum bus frequency in case the SoC is affected. Signed-off-by: Marek Vasut <marex@denx.de> To: linux-i2c@vger.kernel.org Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2022-04-15i2c: pasemi: Wait for write xfers to finishMartin Povišer1-0/+6
Wait for completion of write transfers before returning from the driver. At first sight it may seem advantageous to leave write transfers queued for the controller to carry out on its own time, but there's a couple of issues with it: * Driver doesn't check for FIFO space. * The queued writes can complete while the driver is in its I2C read transfer path which means it will get confused by the raising of XEN (the 'transaction ended' signal). This can cause a spurious ENODATA error due to premature reading of the MRXFIFO register. Adding the wait fixes some unreliability issues with the driver. There's some efficiency cost to it (especially with pasemi_smb_waitready doing its polling), but that will be alleviated once the driver receives interrupt support. Fixes: beb58aa39e6e ("i2c: PA Semi SMBus driver") Signed-off-by: Martin Povišer <povik+lin@cutebit.org> Reviewed-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2022-04-15dm: fix bio length of empty flushShin'ichiro Kawasaki1-0/+1
The commit 92986f6b4c8a ("dm: use bio_clone_fast in alloc_io/alloc_tio") removed bio_clone_fast() call from alloc_tio() when ci->io->tio is available. In this case, ci->bio is not copied to ci->io->tio.clone. This is fine since init_clone_info() sets same values to ci->bio and ci->io->tio.clone. However, when incoming bios have REQ_PREFLUSH flag, __send_empty_flush() prepares a zero length bio on stack and set it to ci->bio. At this time, ci->io->tio.clone still keeps non-zero length. When alloc_tio() chooses this ci->io->tio.clone as the bio to map, it is passed to targets as non-empty flush bio. It causes bio length check failure in dm-zoned and unexpected operation such as dm_accept_partial_bio() call. To avoid the non-empty flush bio, set zero length to ci->io->tio.clone in __send_empty_flush(). Fixes: 92986f6b4c8a ("dm: use bio_clone_fast in alloc_io/alloc_tio") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-04-15block: don't print I/O error warning for dead disksChristoph Hellwig1-1/+2
When a disk has been marked dead, don't print warnings for I/O errors as they are very much expected. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220323163815.1526998-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-15block/compat_ioctl: fix range check in BLKGETSIZEKhazhismel Kumykov1-1/+1
kernel ulong and compat_ulong_t may not be same width. Use type directly to eliminate mismatches. This would result in truncation rather than EFBIG for 32bit mode for large disks. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220414224056.2875681-1-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-15nvme-pci: disable namespace identifiers for Qemu controllersChristoph Hellwig1-1/+4
Qemu unconditionally reports a UUID, which depending on the qemu version is either all-null (which is incorrect but harmless) or contains a single bit set for all controllers. In addition it can also optionally report a eui64 which needs to be manually set. Disable namespace identifiers for Qemu controlles entirely even if in some cases they could be set correctly through manual intervention. Reported-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-04-15nvme-pci: disable namespace identifiers for the MAXIO MAP1002/1202Christoph Hellwig1-0/+4
The MAXIO MAP1002/1202 controllers reports completely bogus Namespace identifiers that even change after suspend cycles. Disable using the Identifiers entirely. Reported-by: 金韬 <me@kingtous.cn> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: 金韬 <me@kingtous.cn>
2022-04-15nvme: add a quirk to disable namespace identifiersChristoph Hellwig2-6/+23
Add a quirk to disable using and exporting namespace identifiers for controllers where they are broken beyond repair. The most directly visible problem with non-unique namespace identifiers is that they break the /dev/disk/by-id/ links, with the link for a supposedly unique identifier now pointing to one of multiple possible namespaces that share the same ID, and a somewhat random selection of which one actually shows up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>