summaryrefslogtreecommitdiffstats
path: root/kernel (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'trace-v5.14-2' of ↵Linus Torvalds2021-07-092-2/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix and cleanup from Steven Rostedt: "Tracing fix for histograms and a clean up in ftrace: - Fixed a bug that broke the .sym-offset modifier and added a test to make sure nothing breaks it again. - Replace a list_del/list_add() with a list_move()" * tag 'trace-v5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Use list_move instead of list_del/list_add tracing/selftests: Add tests to test histogram sym and sym-offset modifiers tracing/histograms: Fix parsing of "sym-offset" modifier
| * ftrace: Use list_move instead of list_del/list_addBaokun Li2021-07-081-2/+1
| | | | | | | | | | | | | | | | | | | | Using list_move() instead of list_del() + list_add(). Link: https://lkml.kernel.org/r/20210608031108.2820996-1-libaokun1@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * tracing/histograms: Fix parsing of "sym-offset" modifierSteven Rostedt (VMware)2021-07-071-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the addition of simple mathematical operations (plus and minus), the parsing of the "sym-offset" modifier broke, as it took the '-' part of the "sym-offset" as a minus, and tried to break it up into a mathematical operation of "field.sym - offset", in which case it failed to parse (unless the event had a field called "offset"). Both .sym and .sym-offset modifiers should not be entered into mathematical calculations anyway. If ".sym-offset" is found in the modifier, then simply make it not an operation that can be calculated on. Link: https://lkml.kernel.org/r/20210707110821.188ae255@oasis.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: stable@vger.kernel.org Fixes: 100719dcef447 ("tracing: Add simple expression support to hist triggers") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
* | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2021-07-095-74/+129
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull yet more updates from Andrew Morton: "54 patches. Subsystems affected by this patch series: lib, mm (slub, secretmem, cleanups, init, pagemap, and mremap), and debug" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (54 commits) powerpc/mm: enable HAVE_MOVE_PMD support powerpc/book3s64/mm: update flush_tlb_range to flush page walk cache mm/mremap: allow arch runtime override mm/mremap: hold the rmap lock in write mode when moving page table entries. mm/mremap: use pmd/pud_poplulate to update page table entries mm/mremap: don't enable optimized PUD move if page table levels is 2 mm/mremap: convert huge PUD move to separate helper selftest/mremap_test: avoid crash with static build selftest/mremap_test: update the test to handle pagesize other than 4K mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t * mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t * kdump: use vmlinux_build_id to simplify buildid: fix kernel-doc notation buildid: mark some arguments const scripts/decode_stacktrace.sh: indicate 'auto' can be used for base path scripts/decode_stacktrace.sh: silence stderr messages from addr2line/nm scripts/decode_stacktrace.sh: support debuginfod x86/dumpstack: use %pSb/%pBb for backtrace printing arm64: stacktrace: use %pSb for backtrace printing module: add printk formats to add module build ID to stacktraces ...
| * | kdump: use vmlinux_build_id to simplifyStephen Boyd2021-07-081-48/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can use the vmlinux_build_id array here now instead of open coding it. This mostly consolidates code. Link: https://lkml.kernel.org/r/20210511003845.2429846-14-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Evan Green <evgreen@chromium.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sasha Levin <sashal@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | module: add printk formats to add module build ID to stacktracesStephen Boyd2021-07-082-25/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's make kernel stacktraces easier to identify by including the build ID[1] of a module if the stacktrace is printing a symbol from a module. This makes it simpler for developers to locate a kernel module's full debuginfo for a particular stacktrace. Combined with scripts/decode_stracktrace.sh, a developer can download the matching debuginfo from a debuginfod[2] server and find the exact file and line number for the functions plus offsets in a stacktrace that match the module. This is especially useful for pstore crash debugging where the kernel crashes are recorded in something like console-ramoops and the recovery kernel/modules are different or the debuginfo doesn't exist on the device due to space concerns (the debuginfo can be too large for space limited devices). Originally, I put this on the %pS format, but that was quickly rejected given that %pS is used in other places such as ftrace where build IDs aren't meaningful. There was some discussions on the list to put every module build ID into the "Modules linked in:" section of the stacktrace message but that quickly becomes very hard to read once you have more than three or four modules linked in. It also provides too much information when we don't expect each module to be traversed in a stacktrace. Having the build ID for modules that aren't important just makes things messy. Splitting it to multiple lines for each module quickly explodes the number of lines printed in an oops too, possibly wrapping the warning off the console. And finally, trying to stash away each module used in a callstack to provide the ID of each symbol printed is cumbersome and would require changes to each architecture to stash away modules and return their build IDs once unwinding has completed. Instead, we opt for the simpler approach of introducing new printk formats '%pS[R]b' for "pointer symbolic backtrace with module build ID" and '%pBb' for "pointer backtrace with module build ID" and then updating the few places in the architecture layer where the stacktrace is printed to use this new format. Before: Call trace: lkdtm_WARNING+0x28/0x30 [lkdtm] direct_entry+0x16c/0x1b4 [lkdtm] full_proxy_write+0x74/0xa4 vfs_write+0xec/0x2e8 After: Call trace: lkdtm_WARNING+0x28/0x30 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9] direct_entry+0x16c/0x1b4 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9] full_proxy_write+0x74/0xa4 vfs_write+0xec/0x2e8 [akpm@linux-foundation.org: fix build with CONFIG_MODULES=n, tweak code layout] [rdunlap@infradead.org: fix build when CONFIG_MODULES is not set] Link: https://lkml.kernel.org/r/20210513171510.20328-1-rdunlap@infradead.org [akpm@linux-foundation.org: make kallsyms_lookup_buildid() static] [cuibixuan@huawei.com: fix build error when CONFIG_SYSFS is disabled] Link: https://lkml.kernel.org/r/20210525105049.34804-1-cuibixuan@huawei.com Link: https://lkml.kernel.org/r/20210511003845.2429846-6-swboyd@chromium.org Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId [1] Link: https://sourceware.org/elfutils/Debuginfod.html [2] Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Evan Green <evgreen@chromium.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Young <dyoung@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Sasha Levin <sashal@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | PM: hibernate: disable when there are active secretmem usersMike Rapoport2021-07-081-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is unsafe to allow saving of secretmem areas to the hibernation snapshot as they would be visible after the resume and this essentially will defeat the purpose of secret memory mappings. Prevent hibernation whenever there are active secret memory users. Link: https://lkml.kernel.org/r/20210518072034.31572-6-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | mm: introduce memfd_secret system call to create "secret" memory areasMike Rapoport2021-07-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well. The secretmem feature is off by default and the user must explicitly enable it at the boot time. Once secretmem is enabled, the user will be able to create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped in the page table of the processes that have access to the file descriptor. Secretmem is designed to provide the following protections: * Enhanced protection (in conjunction with all the other in-kernel attack prevention systems) against ROP attacks. Seceretmem makes "simple" ROP insufficient to perform exfiltration, which increases the required complexity of the attack. Along with other protections like the kernel stack size limit and address space layout randomization which make finding gadgets is really hard, absence of any in-kernel primitive for accessing secret memory means the one gadget ROP attack can't work. Since the only way to access secret memory is to reconstruct the missing mapping entry, the attacker has to recover the physical page and insert a PTE pointing to it in the kernel and then retrieve the contents. That takes at least three gadgets which is a level of difficulty beyond most standard attacks. * Prevent cross-process secret userspace memory exposures. Once the secret memory is allocated, the user can't accidentally pass it into the kernel to be transmitted somewhere. The secreremem pages cannot be accessed via the direct map and they are disallowed in GUP. * Harden against exploited kernel flaws. In order to access secretmem, a kernel-side attack would need to either walk the page tables and create new ones, or spawn a new privileged uiserspace process to perform secrets exfiltration using ptrace. The file descriptor based memory has several advantages over the "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File descriptor approach allows explicit and controlled sharing of the memory areas, it allows to seal the operations. Besides, file descriptor based memory paves the way for VMMs to remove the secret memory range from the userspace hipervisor process, for instance QEMU. Andy Lutomirski says: "Getting fd-backed memory into a guest will take some possibly major work in the kernel, but getting vma-backed memory into a guest without mapping it in the host user address space seems much, much worse." memfd_secret() is made a dedicated system call rather than an extension to memfd_create() because it's purpose is to allow the user to create more secure memory mappings rather than to simply allow file based access to the memory. Nowadays a new system call cost is negligible while it is way simpler for userspace to deal with a clear-cut system calls than with a multiplexer or an overloaded syscall. Moreover, the initial implementation of memfd_secret() is completely distinct from memfd_create() so there is no much sense in overloading memfd_create() to begin with. If there will be a need for code sharing between these implementation it can be easily achieved without a need to adjust user visible APIs. The secret memory remains accessible in the process context using uaccess primitives, but it is not exposed to the kernel otherwise; secret memory areas are removed from the direct map and functions in the follow_page()/get_user_page() family will refuse to return a page that belongs to the secret memory area. Once there will be a use case that will require exposing secretmem to the kernel it will be an opt-in request in the system call flags so that user would have to decide what data can be exposed to the kernel. Removing of the pages from the direct map may cause its fragmentation on architectures that use large pages to map the physical memory which affects the system performance. However, the original Kconfig text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1] showed that "... although 1G mappings are a good default choice, there is no compelling evidence that it must be the only choice". Hence, it is sufficient to have secretmem disabled by default with the ability of a system administrator to enable it at boot time. Pages in the secretmem regions are unevictable and unmovable to avoid accidental exposure of the sensitive data via swap or during page migration. Since the secretmem mappings are locked in memory they cannot exceed RLIMIT_MEMLOCK. Since these mappings are already locked independently from mlock(), an attempt to mlock()/munlock() secretmem range would fail and mlockall()/munlockall() will ignore secretmem mappings. However, unlike mlock()ed memory, secretmem currently behaves more like long-term GUP: secretmem mappings are unmovable mappings directly consumed by user space. With default limits, there is no excessive use of secretmem and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in the future this should be addressed to allow balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA. A page that was a part of the secret memory area is cleared when it is freed to ensure the data is not exposed to the next user of that page. The following example demonstrates creation of a secret mapping (error handling is omitted): fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ [akpm@linux-foundation.org: suppress Kconfig whine] Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net> Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Fix UCOUNT_RLIMIT_SIGPENDING counter leakAlexey Gladkov2021-07-081-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We must properly handle an errors when we increase the rlimit counter and the ucounts reference counter. We have to this with RCU protection to prevent possible use-after-free that could occur due to concurrent put_cred_rcu(). The following reproducer triggers the problem: $ cat testcase.sh case "${STEP:-0}" in 0) ulimit -Si 1 ulimit -Hi 1 STEP=1 unshare -rU "$0" killall sleep ;; 1) for i in 1 2 3 4 5; do unshare -rU sleep 5 & done ;; esac with the KASAN report being along the lines of BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0 Write of size 4 at addr ffff8880045f031c by task swapper/2/0 CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014 Call Trace: <IRQ> put_ucounts+0x17/0xa0 put_cred_rcu+0xd5/0x190 rcu_core+0x3bf/0xcb0 __do_softirq+0xe3/0x341 irq_exit_rcu+0xbe/0xe0 sysvec_apic_timer_interrupt+0x6a/0x90 </IRQ> asm_sysvec_apic_timer_interrupt+0x12/0x20 default_idle_call+0x53/0x130 do_idle+0x311/0x3c0 cpu_startup_entry+0x14/0x20 secondary_startup_64_no_verify+0xc2/0xcb Allocated by task 127: kasan_save_stack+0x1b/0x40 __kasan_kmalloc+0x7c/0x90 alloc_ucounts+0x169/0x2b0 set_cred_ucounts+0xbb/0x170 ksys_unshare+0x24c/0x4e0 __x64_sys_unshare+0x16/0x20 do_syscall_64+0x37/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 0: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0xeb/0x120 kfree+0xaa/0x460 put_cred_rcu+0xd5/0x190 rcu_core+0x3bf/0xcb0 __do_softirq+0xe3/0x341 The buggy address belongs to the object at ffff8880045f0300 which belongs to the cache kmalloc-192 of size 192 The buggy address is located 28 bytes inside of 192-byte region [ffff8880045f0300, ffff8880045f03c0) The buggy address belongs to the page: page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0 flags: 0x100000000000200(slab|node=0|zone=1) raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00 raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Disabling lock debugging due to kernel taint Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'pm-5.14-rc1-2' of ↵Linus Torvalds2021-07-071-0/+1
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These include cpufreq core simplifications and fixes, cpufreq driver updates, cpuidle driver update, a generic power domains (genpd) locking fix and a debug-related simplification of the PM core. Specifics: - Drop the ->stop_cpu() (not really useful) and ->resolve_freq() (unused) cpufreq driver callbacks and modify the users of the former accordingly (Viresh Kumar, Rafael Wysocki). - Add frequency invariance support to the ACPI CPPC cpufreq driver again along with the related fixes and cleanups (Viresh Kumar). - Update the Meditak, qcom and SCMI ARM cpufreq drivers (Fabien Parent, Seiya Wang, Sibi Sankar, Christophe JAILLET). - Rename black/white-lists in the DT cpufreq driver (Viresh Kumar). - Add generic performance domains support to the dvfs DT bindings (Sudeep Holla). - Refine locking in the generic power domains (genpd) support code to avoid lock dependency issues (Stephen Boyd). - Update the MSM and qcom ARM cpuidle drivers (Bartosz Dudziak). - Simplify the PM core debug code by using ktime_us_delta() to compute time interval lengths (Mark-PK Tsai)" * tag 'pm-5.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (21 commits) PM: domains: Shrink locking area of the gpd_list_lock PM: sleep: Use ktime_us_delta() in initcall_debug_report() cpufreq: CPPC: Add support for frequency invariance arch_topology: Avoid use-after-free for scale_freq_data cpufreq: CPPC: Pass structure instance by reference cpufreq: CPPC: Fix potential memleak in cppc_cpufreq_cpu_init cpufreq: Remove ->resolve_freq() cpufreq: Reuse cpufreq_driver_resolve_freq() in __cpufreq_driver_target() cpufreq: Remove the ->stop_cpu() driver callback cpufreq: powernv: Migrate to ->exit() callback instead of ->stop_cpu() cpufreq: CPPC: Migrate to ->exit() callback instead of ->stop_cpu() cpufreq: intel_pstate: Combine ->stop_cpu() and ->offline() cpuidle: qcom: Add SPM register data for MSM8226 dt-bindings: arm: msm: Add SAW2 for MSM8226 dt-bindings: cpufreq: update cpu type and clock name for MT8173 SoC clk: mediatek: remove deprecated CLK_INFRA_CA57SEL for MT8173 SoC cpufreq: dt: Rename black/white-lists cpufreq: scmi: Fix an error message cpufreq: mediatek: add support for mt8365 dt-bindings: dvfs: Add support for generic performance domains ...
| * | cpufreq: CPPC: Add support for frequency invarianceViresh Kumar2021-07-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Frequency Invariance Engine (FIE) is providing a frequency scaling correction factor that helps achieve more accurate load-tracking. Normally, this scaling factor can be obtained directly with the help of the cpufreq drivers as they know the exact frequency the hardware is running at. But that isn't the case for CPPC cpufreq driver. Another way of obtaining that is using the arch specific counter support, which is already present in kernel, but that hardware is optional for platforms. This patch updates the CPPC driver to register itself with the topology core to provide its own implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which gets called by the scheduler on every tick. Note that the arch specific counters have higher priority than CPPC counters, if available, though the CPPC driver doesn't need to have any special handling for that. On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we reach here from hard-irq context), which then schedules a normal work item and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable based on the counter updates since the last tick. To allow platforms to disable this CPPC counter-based frequency invariance support, this is all done under CONFIG_ACPI_CPPC_CPUFREQ_FIE, which is enabled by default. This also exports sched_setattr_nocheck() as the CPPC driver can be built as a module. Cc: linux-acpi@vger.kernel.org Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
* | | Merge tag 'modules-for-v5.14' of ↵Linus Torvalds2021-07-071-3/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull module updates from Jessica Yu: - Fix incorrect logic in module_kallsyms_on_each_symbol() - Fix for a Coccinelle warning * tag 'modules-for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: correctly exit module_kallsyms_on_each_symbol when fn() != 0 kernel/module: Use BUG_ON instead of if condition followed by BUG
| * | | module: correctly exit module_kallsyms_on_each_symbol when fn() != 0Jon Mediero2021-05-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 013c1667cf78 ("kallsyms: refactor {,module_}kallsyms_on_each_symbol") replaced the return inside the nested loop with a break, changing the semantics of the function: the break only exits the innermost loop, so the code continues iterating the symbols of the next module instead of exiting. Fixes: 013c1667cf78 ("kallsyms: refactor {,module_}kallsyms_on_each_symbol") Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jon Mediero <jmdr@disroot.org> Signed-off-by: Jessica Yu <jeyu@kernel.org>
| * | | kernel/module: Use BUG_ON instead of if condition followed by BUGzhouchuangao2021-05-141-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following coccinelle report: kernel/module.c:1018:2-5: WARNING: Use BUG_ON instead of if condition followed by BUG. BUG_ON uses unlikely in if(). Through disassembly, we can see that brk #0x800 is compiled to the end of the function. As you can see below: ...... ffffff8008660bec: d65f03c0 ret ffffff8008660bf0: d4210000 brk #0x800 Usually, the condition in if () is not satisfied. For the multi-stage pipeline, we do not need to perform fetch decode and excute operation on brk instruction. In my opinion, this can improve the efficiency of the multi-stage pipeline. Signed-off-by: zhouchuangao <zhouchuangao@vivo.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>
* | | | Merge tag 'kgdb-5.14-rc1' of ↵Linus Torvalds2021-07-063-6/+7
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb updates from Daniel Thompson: "This was a extremely quiet cycle for kgdb. This consists of two patches that between them address spelling errors and a switch fallthrough warning" * tag 'kgdb-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kgdb: Fix fall-through warning for Clang kgdb: Fix spelling mistakes
| * | | | kgdb: Fix fall-through warning for ClangGustavo A. R. Silva2021-06-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to enable -Wimplicit-fallthrough for Clang, fix a fall-through warning by explicitly adding a goto statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20210528200222.GA39201@embeddedor Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
| * | | | kgdb: Fix spelling mistakesZhen Lei2021-06-013-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix some spelling mistakes in comments: initalization ==> initialization detatch ==> detach represntation ==> representation hexidecimal ==> hexadecimal delimeter ==> delimiter architecure ==> architecture Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Link: https://lore.kernel.org/r/20210529110305.9446-3-thunder.leizhen@huawei.com Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
* | | | | Merge branch 'core-rcu-2021.07.04' of ↵Linus Torvalds2021-07-0415-475/+734
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Bitmap parsing support for "all" as an alias for all bits - Documentation updates - Miscellaneous fixes, including some that overlap into mm and lockdep - kvfree_rcu() updates - mem_dump_obj() updates, with acks from one of the slab-allocator maintainers - RCU NOCB CPU updates, including limited deoffloading - SRCU updates - Tasks-RCU updates - Torture-test updates * 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits) tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states rcu: Add missing __releases() annotation rcu: Remove obsolete rcu_read_unlock() deadlock commentary rcu: Improve comments describing RCU read-side critical sections rcu: Create an unrcu_pointer() to remove __rcu from a pointer srcu: Early test SRCU polling start rcu: Fix various typos in comments rcu/nocb: Unify timers rcu/nocb: Prepare for fine-grained deferred wakeup rcu/nocb: Only cancel nocb timer if not polling rcu/nocb: Delete bypass_timer upon nocb_gp wakeup rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup rcu/nocb: Allow de-offloading rdp leader rcu/nocb: Directly call __wake_nocb_gp() from bypass timer rcu: Don't penalize priority boosting when there is nothing to boost rcu: Point to documentation of ordering guarantees rcu: Make rcu_gp_cleanup() be noinline for tracing rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP ...
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| *-----------. \ \ \ \ Merge branches 'bitmaprange.2021.05.10c', 'doc.2021.05.10c', ↵Paul E. McKenney2021-05-1815-469/+731
| |\ \ \ \ \ \ \ \ \ \ \ | | | |_|_|_|_|_|_|/ / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD bitmaprange.2021.05.10c: Allow "all" for bitmap ranges. doc.2021.05.10c: Documentation updates. fixes.2021.05.13a: Miscellaneous fixes. kvfree_rcu.2021.05.10c: kvfree_rcu() updates. mmdumpobj.2021.05.10c: mem_dump_obj() updates. nocb.2021.05.12a: RCU NOCB CPU updates, including limited deoffloading. srcu.2021.05.12a: SRCU updates. tasks.2021.05.18a: Tasks-RCU updates. torture.2021.05.10c: Torture-test updates.
| | | | | | | | * | | | rcu: Don't penalize priority boosting when there is nothing to boostPaul E. McKenney2021-05-111-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RCU priority boosting cannot do anything unless there is at least one task blocking the current RCU grace period that was preempted within the RCU read-side critical section that it still resides in. However, the current rcu_torture_boost_failed() code will count this as an RCU priority-boosting failure if there were no CPUs blocking the current grace period. This situation can happen (for example) if the last CPU blocking the current grace period was subjected to vCPU preemption, which is always a risk for rcutorture guest OSes. This commit therefore causes rcu_torture_boost_failed() to refrain from reporting failure unless there is at least one task blocking the current RCU grace period that was preempted within the RCU read-side critical section that it still resides in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | | * | | | rcutorture: Move mem_dump_obj() tests into separate functionPaul E. McKenney2021-05-111-39/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make the purpose of the code more apparent, this commit moves the tests of mem_dump_obj() to a new rcu_torture_mem_dump_obj() function and calls it from rcu_torture_cleanup(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | | * | | | rcutorture: Don't count CPU-stalled time against priority boostingPaul E. McKenney2021-05-112-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It will frequently be the case that rcu_torture_boost() will get a ->start_gp_poll() cookie that needs almost all of the current grace period plus an additional grace period to elapse before ->poll_gp_state() will return true. It is quite possible that the current grace period will have (say) two seconds of stall by a CPU failing to pass through a quiescent state, followed by 300 milliseconds of delay due to a preempted reader. The next grace period might suffer only one second of stall by a CPU, followed by another 300 milliseconds of delay due to a preempted reader. This is an example of RCU priority boosting doing its job, but the full elapsed time of 3.6 seconds exceeds the 3.5-second limit. In addition, there is no CPU stall in force at the 3.5-second mark, so this would nevertheless currently be counted as an RCU priority boosting failure. This commit therefore avoids this sort of false positive by resetting the gp_state_time timestamp any time that the current grace period is being blocked by a CPU. This results in extremely frequent calls to the ->check_boost_failed() function, so this commit provides a lockless fastpath that is selected by supplying a NULL CPU-number pointer. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | | * | | | rcutorture: Forgive RCU boost failures when CPUs don't pass through QSPaul E. McKenney2021-05-113-26/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, rcu_torture_boost() runs CPU-bound at real-time priority to force RCU priority inversions. It then checks that grace periods progress during this CPU-bound time. If grace periods fail to progress, it reports and RCU priority boosting failure. However, it is possible (and sometimes does happen) that the grace period fails to progress due to a CPU failing to pass through a quiescent state for an extended time period (3.5 seconds by default). This can happen due to vCPU preemption, long-running interrupts, and much else besides. There is nothing that RCU priority boosting can do about these situations, and so they should not be counted as RCU priority boosting failures. This commit therefore checks for CPUs (as opposed to preempted tasks) holding up a grace period, and flags the resulting RCU priority boosting failures, but does not splat nor count them as errors. It does rate-limit them to avoid flooding the console log. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | | * | | | rcutorture: Make rcu_torture_boost_failed() check for GP endPaul E. McKenney2021-05-111-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible that a delayed grace period that rcu_torture_boost() was polling for ended while rcu_torture_boost_failed() was printing the failure splat. It would be good to know when this happens. This commit therefore has rcu_torture_boost_failed() recheck the grace period after printing the splat, and printing a message indicating whether or not the grace period has ended. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | | * | | | rcutorture: Consolidate rcu_torture_boost() timing and statisticsPaul E. McKenney2021-05-111-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit consolidates two loops in rcu_torture_boost(), one of which counts the number of boost-test episodes and the other of which computes the start time of the next episode, into one loop that does both with but a single acquisition of boost_mutex. This means that the count of the number of boost-test episodes is incremented after an episode completes rather than before it starts, but it also avoids the over-counting that was possible previously. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | | * | | | rcutorture: Delay-based false positives for RCU priority boosting testsPaul E. McKenney2021-05-111-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an rcu_torture_boost() kthread determines that its grace period has not yet ended, it invokes rcu_torture_boost_failed() which checks whether enough time has elapsed for this to be considered a failure of RCU priority boosting, and, if so, flags the error. Unfortunately, that kthread might be preempted for some seconds between the time that it checks the grace period and the time that it checks the time. This delay can result in a false positive, featuring a complaint that a particular grace period has not ended, followed by a diagnostic dump featuring a much later grace period. This commit avoids these false positives by rechecking for the end of the grace period after the time check. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | | * | | | rcutorture: Judge RCU priority boosting on grace periods, not callbacksPaul E. McKenney2021-05-111-60/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, rcutorture's testing of RCU priority boosting insists not only that grace periods complete, but also that callbacks be invoked. Although this is in fact what the user would want, ensuring that there is sufficient CPU bandwidth devoted to callback execution is in fact the user's responsibility. One could argue that rcutorture can take on that responsibility, which is true in theory. But in practice, ensuring sufficient CPU bandwidth to ksoftirqd, any rcuc kthreads, and any rcuo kthreads is not particularly consistent with rcutorture's main job, that of stress-testing RCU. In addition, if the system administrator (say) makes very poor choices when pinning rcuo kthreads and then runs rcutorture, there really isn't much rcutorture can do. Besides, RCU priority boosting only boosts lagging readers, not all the machinery required to invoke callbacks in a timely fashion. This commit therefore switches rcutorture's evaluation of RCU priority boosting from callback execution to grace-period completion by using the new start_poll_synchronize_rcu() and poll_state_synchronize_rcu() functions. When rcutorture is built in (as in when there is no innocent workload to inconvenience), the ksoftirqd ktheads are boosted to real-time priority 2 in order to allow timeouts to work properly in the face of rcutorture's testing of RCU priority boosting. Indeed, it is not as easy as it looks to create a reliable test of RCU priority boosting without destroying the rest of the kernel! Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | | * | | | rcutorture: Abstract read-lock-held checksPaul E. McKenney2021-05-111-10/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a (*readlock_held)() function pointer to the rcu_torture_ops structure in order to make the rcu_torture_one_read() function's rcu_dereference_check() lockdep expression more appropriate for a given run. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | | * | | | refscale: Add acqrel, lock, and lock-irqPaul E. McKenney2021-05-111-2/+107
| | | |_|_|_|_|/ / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds scale_type of acqrel, lock, and lock-irq to test acquisition and release. Note that the refscale.nreaders=1 module parameter is required if you wish to test uncontended locking. In contrast, acqrel uses a per-CPU variable, so should be just fine with large values of the refscale.nreaders=1 module parameter. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | * | | | tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inlinePaul E. McKenney2021-05-182-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some architectures, the no-op variant of show_rcu_tasks_gp_kthreads() get "no previous prototype" compiler warnings. These are false positives given that kernel/rcu/tasks.h is included only once. But why put up with the compiler noise? This commit therefore adds "static inline" to this definition to force the compiler to accept this situation, while also moving it to its proper place in kernel/rcu/rcu.h. Reported-by: kernel test robot <lkp@intel.com> [ paulmck: Update per Stephen Rothwell feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | * | | | rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent statesPaul E. McKenney2021-05-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Heavy networking load can cause a CPU to execute continuously and indefinitely within ksoftirqd, in which case there will be no voluntary task switches and thus no RCU-tasks quiescent states. This commit therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks quiescent state. This of course means that __do_softirq() and its callers cannot be invoked from within a tracing trampoline. Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org>
| | | | | | | * | | | rcu-tasks: Add block comment laying out RCU Rude designPaul E. McKenney2021-05-111-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a block comment that gives a high-level overview of how RCU Rude grace periods progress. It also gives an overview of the memory ordering. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | | * | | | rcu-tasks: Add block comment laying out RCU Tasks designPaul E. McKenney2021-05-111-0/+40
| | | |_|_|_|/ / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a block comment that gives a high-level overview of how RCU tasks grace periods progress. It also adds a note about how exiting tasks are handled, plus it gives an overview of the memory ordering. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | * | | | srcu: Early test SRCU polling startFrederic Weisbecker2021-05-121-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Place an early call to start_poll_synchronize_srcu() before the invocation of call_srcu() on the same srcu_struct structure. After the later call to srcu_barrier(), the completion of the first grace period should be visible to a subsequent invocation of poll_state_synchronize_srcu(), and if not, warn. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | * | | | srcu: Fix broken node geometry after early ssp initFrederic Weisbecker2021-05-113-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An srcu_struct structure that is initialized before rcu_init_geometry() will have its srcu_node hierarchy based on CONFIG_NR_CPUS. Once rcu_init_geometry() is called, this hierarchy is compressed as needed for the actual maximum number of CPUs for this system. Later on, that srcu_struct structure is confused, sometimes referring to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead to the new num_possible_cpus() hierarchy. For example, each of its ->mynode fields continues to reference the original leaf rcu_node structures, some of which might no longer exist. On the other hand, srcu_for_each_node_breadth_first() traverses to the new node hierarchy. There are at least two bad possible outcomes to this: 1) a) A callback enqueued early on an srcu_data structure (call it *sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf (say 3 levels). b) The grace period ends after rcu_init_geometry() shrinks the nodes level to a single one. srcu_gp_end() walks through the new srcu_node hierarchy without ever reaching the old leaves so the callback is never executed. This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests verification never completes and the boot hangs: [ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds. [ 5413.147564] Not tainted 5.12.0-rc4+ #28 [ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5413.159753] task:swapper/0 state:D stack: 0 pid: 1 ppid: 0 flags:0x00004000 [ 5413.168099] Call Trace: [ 5413.170555] __schedule+0x36c/0x930 [ 5413.174057] ? wait_for_completion+0x88/0x110 [ 5413.178423] schedule+0x46/0xf0 [ 5413.181575] schedule_timeout+0x284/0x380 [ 5413.185591] ? wait_for_completion+0x88/0x110 [ 5413.189957] ? mark_held_locks+0x61/0x80 [ 5413.193882] ? mark_held_locks+0x61/0x80 [ 5413.197809] ? _raw_spin_unlock_irq+0x24/0x50 [ 5413.202173] ? wait_for_completion+0x88/0x110 [ 5413.206535] wait_for_completion+0xb4/0x110 [ 5413.210724] ? srcu_torture_stats_print+0x110/0x110 [ 5413.215610] srcu_barrier+0x187/0x200 [ 5413.219277] ? rcu_tasks_verify_self_tests+0x50/0x50 [ 5413.224244] ? rdinit_setup+0x2b/0x2b [ 5413.227907] rcu_verify_early_boot_tests+0x2d/0x40 [ 5413.232700] do_one_initcall+0x63/0x310 [ 5413.236541] ? rdinit_setup+0x2b/0x2b [ 5413.240207] ? rcu_read_lock_sched_held+0x52/0x80 [ 5413.244912] kernel_init_freeable+0x253/0x28f [ 5413.249273] ? rest_init+0x250/0x250 [ 5413.252846] kernel_init+0xa/0x110 [ 5413.256257] ret_from_fork+0x22/0x30 2) An srcu_struct structure that is initialized before rcu_init_geometry() and used afterward will always have stale rdp->mynode references, resulting in callbacks to be missed in srcu_gp_end(), just like in the previous scenario. This commit therefore causes init_srcu_struct_nodes to initialize the geometry, if needed. This ensures that the srcu_node hierarchy is properly built and distributed from the get-go. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | * | | | srcu: Initialize SRCU after timersFrederic Weisbecker2021-05-114-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once srcu_init() is called, the SRCU core will make use of delayed workqueues, which rely on timers. However init_timers() is called several steps after rcu_init(). This means that a call_srcu() after rcu_init() but before init_timers() would find itself within a dangerously uninitialized timer core. This commit therefore creates a separate call to srcu_init() after init_timer() completes, which ensures that we stay in early SRCU mode until timers are safe(r). Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | * | | | srcu: Remove superfluous ssp initialization for early callbacksFrederic Weisbecker2021-05-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pre-srcu_init() invocations of call_srcu() initialize the srcu_struct structure in question, so there is no need to check this initialization in srcu_init() when initiating grace periods for srcu_struct structures that had early call_srcu() invocations. This commit therefore drops the calls to check_init_srcu_struct() in srcu_init(). Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | | * | | | srcu: Remove superfluous sdp->srcu_lock_count zero fillingFrederic Weisbecker2021-05-111-10/+2
| | | |_|_|/ / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because alloc_percpu() zeroes out the allocated memory, there is no need to zero-fill newly allocated per-CPU memory. This commit therefore removes the loop zeroing the ->srcu_lock_count and ->srcu_unlock_count arrays from init_srcu_struct_nodes(). This is the only use of that function's is_static parameter, which this commit also removes. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | rcu: Fix various typos in commentsIngo Molnar2021-05-126-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix ~12 single-word typos in RCU code comments. [ paulmck: Apply feedback from Randy Dunlap. ] Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | rcu/nocb: Unify timersFrederic Weisbecker2021-05-122-56/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar, this commit merges them together. A new RCU_NOCB_WAKE_BYPASS wake level is introduced. As a result, timers perform all kinds of deferred wake ups but other deferred wakeup callsites only handle non-bypass wakeups in order not to wake up rcuo too early. The timer also unconditionally executes a full barrier so as to order timer_pending() and callback enqueue although the path performing RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also test against the rdp leader instead of the current rdp. This unconditional full barrier shouldn't bring visible overhead since these timers almost never fire. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | rcu/nocb: Prepare for fine-grained deferred wakeupFrederic Weisbecker2021-05-123-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tuning the deferred wakeup level must be done from a safe wakeup point. Currently those sites are: * ->nocb_timer * user/idle/guest entry * CPU down * softirq/rcuc All of these sites perform the wake up for both RCU_NOCB_WAKE and RCU_NOCB_WAKE_FORCE. In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until a timer fires so that we don't wake up the NOCB-gp kthread too early. To prepare for that, this commit specifies the per-callsite wakeup level/limit. Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> [ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | rcu/nocb: Only cancel nocb timer if not pollingFrederic Weisbecker2021-05-121-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit refrains deleting the ->nocb_timer if rcu_nocb is polling because it should not ever have been queued in the polling case. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | rcu/nocb: Delete bypass_timer upon nocb_gp wakeupFrederic Weisbecker2021-05-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A NOCB-gp wake p can safely delete the ->nocb_bypass_timer because nocb_gp_wait() will recheck again the bypass state and rearm the bypass timer if necessary. This commit therefore deletes this timer. Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | rcu/nocb: Cancel nocb_timer upon nocb_gp wakeupFrederic Weisbecker2021-05-121-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When waking up in nocb_gp_wait(), there is no need to keep the nocb_timer around because this function will traverse the whole rdp list. Any update performed before the timer was armed will now be visible after the ->nocb_gp_lock acquire. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | rcu/nocb: Allow de-offloading rdp leaderFrederic Weisbecker2021-05-121-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only thing that prevented an rdp leader from being de-offloaded was the nocb_bypass_timer that used to lock the nocb_lock of the rdp leader. If an rdp gets de-offloaded, it will subtlely ignore rcu_nocb_lock() calls and do its job in the timer unsafely. Worse yet: If it gets re-offloaded in the middle of the timer, rcu_nocb_unlock() would try to unlock, leaving it imbalanced. Now that the nocb_bypass_timer doesn't use the nocb_lock anymore, de-offloading the rdp leader is now safe. This commit therefore allows the rdp leader to be de-offloaded. Reported-by: Paul E. McKenney <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | rcu/nocb: Directly call __wake_nocb_gp() from bypass timerFrederic Weisbecker2021-05-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bypass timer calls __call_rcu_nocb_wake() instead of directly calling __wake_nocb_gp(). The only difference here is that rdp->qlen_last_fqs_check gets overridden. But resetting the deferred force quiescent state base shouldn't be relevant for that timer. In fact the bypass queue in question can be for any rdp from the group and not necessarily the rdp leader on which the bypass timer is attached. This commit therefore calls __wake_nocb_gp() directly. This way we don't even need to lock the ->nocb_lock. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | timer: Revert "timer: Add timer_curr_running()"Frederic Weisbecker2021-05-111-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit dcd42591ebb8a25895b551a5297ea9c24414ba54. The only user was RCU/nocb. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | | * | | | rcu/nocb: Use the rcuog CPU's ->nocb_timerFrederic Weisbecker2021-05-112-64/+77
| | | |_|/ / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently each CPU has its own ->nocb_timer queued when the nocb_gp wakeup must be deferred. This approach has many drawbacks, compared to a solution based on a single timer per NOCB group: * There are a lot of timers to maintain. * The per-rdp ->nocb_lock must be held to queue and cancel the timer and this lock can already be heavily contended. * One timer firing doesn't cancel the other timers in the same group: - These other timers can thus cause spurious wakeups - Each rdp that queued a timer must lock both ->nocb_lock and then ->nocb_gp_lock upon exit from the kernel to idle/user/guest mode. * We can't cancel all of them if we detect an unflushed bypass in nocb_gp_wait(). In fact currently we only ever cancel the ->nocb_timer of the leader group. * The leader group's nocb_timer is cancelled without locking ->nocb_lock in nocb_gp_wait(). This currently appears to be safe but is an accident waiting to happen. * Since the timer acquires ->nocb_lock, it requires extra care in the NOCB (de-)offloading process, requiring that it be either enabled or disabled and then flushed. This commit instead uses the rcuog kthread's CPU's ->nocb_timer instead. It is protected by nocb_gp_lock, which is _way_ less contended and remains so even after this change. As a matter of fact, the nocb_timer almost never fires and the deferred wakeup is mostly carried out upon idle/user/guest entry. Now the early check performed at this point in do_nocb_deferred_wakeup() is done on rdp_gp->nocb_defer_wakeup, which is of course racy. However, this raciness is harmless because we only need the guarantee that the timer is queued if we were the last one to queue it. Any other situation (another CPU has queued it and we either see it or not) is fine. This solves all the issues listed above. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | kvfree_rcu: Refactor kfree_rcu_monitor()Uladzislau Rezki (Sony)2021-05-111-58/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we have three functions which depend on each other. Two of them are quite tiny and the last one where the most work is done. All of them are related to queuing RCU batches to reclaim objects after a GP. 1. kfree_rcu_monitor(). It consist of few lines. It acquires a spin-lock and calls kfree_rcu_drain_unlock(). 2. kfree_rcu_drain_unlock(). It also consists of few lines of code. It calls queue_kfree_rcu_work() to queue the batch. If this fails, it rearms the monitor work to try again later. 3. queue_kfree_rcu_work(). This provides the bulk of the functionality, attempting to start a new batch to free objects after a GP. Since there are no external users of functions [2] and [3], both can eliminated by moving all logic directly into [1], which both shrinks and simplifies the code. Also replace comments which start with "/*" to "//" format to make it unified across the file. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | kvfree_rcu: Fix comments according to current codeUladzislau Rezki (Sony)2021-05-111-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kvfree_rcu() function now defers allocations in the common case due to the fact that there is no lockless access to the memory-allocator caches/pools. In addition, in CONFIG_PREEMPT_NONE=y and in CONFIG_PREEMPT_VOLUNTARY=y kernels, there is no reliable way to determine if spinlocks are held. As a result, allocation is deferred in the common case, and the two-argument form of kvfree_rcu() thus uses the "channel 3" queue through all the rcu_head structures. This channel is called referred to as the emergency case in comments, and these comments are now obsolete. This commit therefore updates these comments to reflect the new common-case nature of such emergencies. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>