linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	x86/percpu: Clean up <asm/percpu.h> vertical alignment details	Ingo Molnar	2024-05-20	1	-150/+171
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Fix/unify misc vertical alignment inconsistencies - Make CPP macros look a bit more like C code by adding an empty line after local variable declaration blocks, and before final rvalue statements. No change in code. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org
*	x86/percpu: Clean up <asm/percpu.h> a bit	Ingo Molnar	2024-05-20	1	-41/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Fix misc typos - There's 4 variants of the same spelling right now: 'per-CPU', 'per CPU', 'percpu' and 'per-cpu' Standardize on 'per-CPU' only. - s/makes gcc load /makes the compiler load - Instead of: #ifdef CONFIG_XXXX #define YYYY FOO #else #define YYYY BAR #endif Use the slightly more readable form of: #ifdef CONFIG_XXXX # define YYYY FOO #else # define YYYY BAR #endif - Standardize & expand '#else' and '#endif' comments - Fix comment style - Capitalize x86 instruction names in comments No change in code. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org
*	x86/percpu: Move some percpu accessors around to reduce ifdeffery	Uros Bizjak	2024-05-20	1	-21/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move some percpu accessors around, mainly to reduce ifdeffery and improve readabilty by following dependencies between accessors. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240520080951.121049-2-ubizjak@gmail.com
*	x86/percpu: Rename percpu_stable_op() to __raw_cpu_read_stable()	Uros Bizjak	2024-05-20	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rename percpu_stable_op() to __raw_cpu_read_stable() to be in line with other read/write percpu accessors. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240520080951.121049-1-ubizjak@gmail.com
*	x86/percpu: Fix operand constraint modifier in __raw_cpu_write()	Uros Bizjak	2024-05-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	__raw_cpu_write() with !USE_X86_SEG_SUPPORT config uses read/write operand constraint modifier "+" for its memory location. This signals the compiler that the location is both read and written by the asm. This is not true, because MOV insn only writes to the output. Correct the modifier to "=" to inform the compiler that the memory location is only written to. This also prevents the compiler from value tracking the undefined value from the uninitialized memory. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240430091833.196482-5-ubizjak@gmail.com
*	x86/percpu: Introduce the __raw_cpu_read_const() macro	Uros Bizjak	2024-05-18	1	-10/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce the __raw_cpu_read_const() macro to further reduce ifdeffery and differences between configs w/ and w/o USE_X86_SEG_SUPPORT. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240430091833.196482-4-ubizjak@gmail.com
*	x86/percpu: Unify percpu read-write accessors	Uros Bizjak	2024-05-18	1	-47/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Redefine percpu_from_op() and percpu_to_op() as __raw_cpu_read() and __raw_cpu_write(). Unify __raw_cpu_{read,write}() macros between configs w/ and w/o USE_X86_SEG_SUPPORT in order to unify {raw,this}_cpu{read_write}_N() accessors between configs. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240430091833.196482-3-ubizjak@gmail.com
*	x86/percpu: Move some percpu macros around for readability	Uros Bizjak	2024-05-18	1	-29/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move some percpu macros around to make a follow-up patch more readable. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240430091833.196482-2-ubizjak@gmail.com
*	x86/percpu: Introduce the pcpu_binary_op() macro	Uros Bizjak	2024-05-18	1	-17/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce the pcpu_binary_op() macro, a copy of the percpu_to_op() macro. Update percpu binary operators to use the new macro, since percpu_to_op() will be re-purposed as a raw percpu write accessor in a follow-up patch. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240430091833.196482-1-ubizjak@gmail.com
*	x86/percpu: Introduce raw_cpu_read_long() to reduce ifdeffery	Uros Bizjak	2024-04-06	1	-8/+6
\| \| \| \| \| \| \| \| \| \| \| \|	Introduce raw_cpu_read_long() macro to slightly reduce ifdeffery in <asm/percpu.h>. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240404094218.448963-3-ubizjak@gmail.com
*	x86/percpu: Rewrite x86_this_cpu_test_bit() and friends as macros	Uros Bizjak	2024-04-06	1	-31/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rewrite the whole family of x86_this_cpu_test_bit() functions as macros, so standard __my_cpu_var() and raw_cpu_read() macros can be used on percpu variables. This approach considerably simplifies implementation of functions and also introduces standard checks on accessed percpu variables. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240404094218.448963-2-ubizjak@gmail.com
*	x86/percpu: Fix x86_this_cpu_variable_test_bit() asm template	Uros Bizjak	2024-04-06	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix x86_this_cpu_variable_test_bit(), which is implemented with an incorrect asm template, where argument 2 (count argument) is considered a percpu variable. However, x86_this_cpu_test_bit() is currently used exclusively with constant bit number argument, so the called x86_this_cpu_variable_test_bit() function is never instantiated. The fix introduces named assembler operands to prevent this kind of error. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20240404094218.448963-1-ubizjak@gmail.com
*	x86/percpu: Use __force to cast from __percpu address space	Uros Bizjak	2024-04-03	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix Sparse warning when casting from __percpu address space by using __force in the cast. x86 named address spaces are not considered to be subspaces of the generic (flat) address space, so explicit casts are required to convert pointers between these address spaces and the generic address space (the application should cast to uintptr_t and apply the segment base offset). The cast to uintptr_t removes __percpu address space tag and Sparse reports: warning: cast removes address space '__percpu' of expression Use __force to inform Sparse that the cast is intentional. Fixes: 9a462b9eafa6 ("x86/percpu: Use compiler segment prefix qualifier") Reported-by: Charlemagne Lasse <charlemagnelasse@gmail.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240402175058.52649-1-ubizjak@gmail.com Closes: https://lore.kernel.org/lkml/CAFGhKbzev7W4aHwhFPWwMZQEHenVgZUj7=aunFieVqZg3mt14A@mail.gmail.com/
*	x86/percpu: Do not use this_cpu_read_stable_8() for 32-bit targets	Uros Bizjak	2024-03-25	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	this_cpu_read_stable() macro uses __pcpu_size_call_return() that unconditionally calls this_cpu_read_stable_8() also for 32-bit targets. This usage is ivalid as it will result in the generation of 64-bit MOVQ instruction on 32-bit targets via percpu_stable_op() macro. Since there is no generic support for this_cpu_read_stable_8() for 32-bit targets, the patch defines this_cpu_read_stable_8() to BUILD_BUG() when CONFIG_X86_64 is not defined. This way, we are sure that this_cpu_read_stable_8() won't actually be used for 32-bit targets, but it is still defined to prevent build failure. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240324212014.310189-1-ubizjak@gmail.com
*	x86/percpu: Unify arch_raw_cpu_ptr() defines	Uros Bizjak	2024-03-22	1	-24/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When building a 32-bit vDSO for a 64-bit kernel, games are played with CONFIG_X86_64. {this,raw}_cpu_read_8() macros are conditionally defined on CONFIG_X86_64 and when CONFIG_X86_64 is undefined in fake_32bit_build.h various build failures in generic percpu header files can happen. To make things worse, the build of 32-bit vDSO for a 64-bit kernel grew dependency on arch_raw_cpu_ptr() macro and the build fails if arch_raw_cpu_ptr() macro is not defined. To mitigate these issues, x86 carefully defines arch_raw_cpu_ptr() to avoid any dependency on raw_cpu_read_8() and thus CONFIG_X86_64. W/o segment register support, the definition uses size-agnostic MOV asm mnemonic and hopes that _ptr argument won't ever be 64-bit size on 32-bit targets (although newer GCCs warn for this situation with "unsupported size for integer register"), and w/ segment register support the definition uses size-agnostic __raw_cpu_read() macro. Fortunately, raw_cpu_read() is not used in 32-bit vDSO for a 64-bit kernel. However, we can't simply omit the definition of arch_raw_cpu_read(), since the build will fail when building vdso/vdso32/vclock_gettime.o. The patch defines arch_raw_cpu_ptr to BUILD_BUG() when BUILD_VDSO32_64 macro is defined. This way, we are sure that arch_raw_cpu_ptr() won't actually be used in 32-bit VDSO for a 64-bit kernel, but it is still defined to prevent build failure. Finally, we can unify arch_raw_cpu_ptr() between builds w/ and w/o x86 segment register support, substituting two tricky macro definitions with a straightforward implementation. There is no size difference and no difference in number of this_cpu_off accesses between patched and unpatched kernel when the kernel is built either w/ and w/o segment register support. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240322102730.209141-1-ubizjak@gmail.com
*	x86/percpu: Move raw_percpu_xchg_op() to a better place	Uros Bizjak	2024-03-20	1	-12/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move raw_percpu_xchg_op() together with this_percpu_xchg_op() and trivially rename some internal variables to harmonize them between macro implementations. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240320083127.493250-2-ubizjak@gmail.com
*	x86/percpu: Convert this_percpu_xchg_op() from asm() to C code, to generate ↵	Uros Bizjak	2024-03-20	1	-21/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	better code Rewrite percpu_xchg_op() using generic percpu primitives instead of using asm. The new implementation is similar to local_xchg() and allows the compiler to perform various optimizations: e.g. the compiler is able to create fast path through the loop, according to likely/unlikely annotations in percpu_try_cmpxchg_op(). No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240320083127.493250-1-ubizjak@gmail.com
*	Merge tag 'v6.8-rc4' into x86/percpu, to resolve conflicts and refresh the ↵	Ingo Molnar	2024-02-14	1	-1/+1
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	branch Conflicts: arch/x86/include/asm/percpu.h arch/x86/include/asm/text-patching.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
\| *	Kill unnecessary kernel.h include	Kent Overstreet	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	More trimming down unnecessary includes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* \|	x86/percpu: Avoid sparse warning with cast to named address space	Uros Bizjak	2023-12-11	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Teach sparse about __seg_fs and __seg_gs named address space qualifiers to to avoid warnings about unexpected keyword at the end of cast operator. Reported-by: kernel test robot <lkp@intel.com> Acked-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231204210320.114429-3-ubizjak@gmail.com Closes: https://lore.kernel.org/oe-kbuild-all/202310080853.UhMe5iWa-lkp@intel.com/
* \|	x86/percpu: Fix "const_pcpu_hot" version generation failure	Uros Bizjak	2023-12-11	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Version generation for "const_pcpu_hot" symbol failed because genksyms doesn't know the __seg_gs keyword. Effectively revert commit 4604c052b84d ("x86/percpu: Declare const_pcpu_hot as extern const variable") and use this_cpu_read_const() instead to avoid "sparse: dereference of noderef expression" warning when reading const_pcpu_hot. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231204210320.114429-1-ubizjak@gmail.com
* \|	x86/percpu: Define PER_CPU_VAR macro also for !__ASSEMBLY__	Uros Bizjak	2023-11-30	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some C source files define 'asm' statements that use PER_CPU_VAR, so make PER_CPU_VAR macro available also without __ASSEMBLY__. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231105213731.1878100-2-ubizjak@gmail.com
* \|	x86/percpu: Introduce const-qualified const_pcpu_hot to micro-optimize code ↵	Uros Bizjak	2023-10-23	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	generation Some variables in pcpu_hot, currently current_task and top_of_stack are actually per-thread variables implemented as per-CPU variables and thus stable for the duration of the respective task. There is already an attempt to eliminate redundant reads from these variables using this_cpu_read_stable() asm macro, which hides the dependency on the read memory address. However, the compiler has limited ability to eliminate asm common subexpressions, so this approach results in a limited success. The solution is to allow more aggressive elimination by aliasing pcpu_hot into a const-qualified const_pcpu_hot, and to read stable per-CPU variables from this constant copy. The current per-CPU infrastructure does not support reads from const-qualified variables. However, when the compiler supports segment qualifiers, it is possible to declare the const-aliased variable in the relevant named address space. The compiler considers access to the variable, declared in this way, as a read from a constant location, and will optimize reads from the variable accordingly. By implementing constant-qualified const_pcpu_hot, the compiler can eliminate redundant reads from the constant variables, reducing the number of loads from current_task from 3766 to 3217 on a test build, a -14.6% reduction. The reduction of loads translates to the following code savings: text data bss dec hex filename 25,477,353 4389456 808452 30675261 1d4113d vmlinux-old.o 25,476,074 4389440 808452 30673966 1d40c2e vmlinux-new.o representing a code size reduction of -1279 bytes. [ mingo: Updated the changelog, EXPORT(const_pcpu_hot). ] Co-developed-by: Nadav Amit <namit@vmware.com> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020162004.135244-1-ubizjak@gmail.com
* \|	x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()	Uros Bizjak	2023-10-20	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce x86_64 %rip-relative addressing to the PER_CPU_VAR() macro. Instructions using %rip-relative address operand are one byte shorter than their absolute address counterparts and are also compatible with position independent executable (-fpie) builds. The patch reduces code size of a test kernel build by 150 bytes. The PER_CPU_VAR() macro is intended to be applied to a symbol and should not be used with register operands. Introduce the new __percpu macro and use it in cmpxchg{8,16}b_emu.S instead. Also add a missing function comment to this_cpu_cmpxchg8b_emu(). No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com>
* \|	x86/percpu: Use the correct asm operand modifier in percpu_stable_op()	Uros Bizjak	2023-10-18	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The "P" asm operand modifier is a x86 target-specific modifier. When used for a constant, it drops all syntax-specific prefixes and issues the bare constant. This modifier is not correct for address handling, in this case a generic "a" operand modifier should be used. The "a" asm operand modifier substitutes a memory reference, with the actual operand treated as address. For x86_64, when a symbol is provided, the "a" modifier emits "sym(%rip)" instead of "sym", enabling shorter %rip-relative addressing. Clang allows only "i" and "r" operand constraints with an "a" modifier, so the patch normalizes the modifier/constraint pair to "a"/"i" which is consistent between both compilers. The patch reduces code size of a test build by 4072 bytes: text data bss dec hex filename 25523268 4388300 808452 30720020 1d4c014 vmlinux-old.o 25519196 4388300 808452 30715948 1d4b02c vmlinux-new.o [ mingo: Changelog clarity. ] Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20231016200755.287403-1-ubizjak@gmail.com
* \|	x86/percpu: Use C for arch_raw_cpu_ptr(), to improve code generation	Uros Bizjak	2023-10-16	1	-0/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement arch_raw_cpu_ptr() in C to allow the compiler to perform better optimizations, such as setting an appropriate base to compute the address. The compiler is free to choose either MOV or ADD from this_cpu_off address to construct the optimal final address. There are some other issues when memory access to the percpu area is implemented with an asm. Compilers can not eliminate asm common subexpressions over basic block boundaries, but are extremely good at optimizing memory access. By implementing arch_raw_cpu_ptr() in C, the compiler can eliminate additional redundant loads from this_cpu_off, further reducing the number of percpu offset reads from 1646 to 1631 on a test build, a -0.9% reduction. Co-developed-by: Nadav Amit <namit@vmware.com> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20231015202523.189168-2-ubizjak@gmail.com
* \|	x86/percpu: Rewrite arch_raw_cpu_ptr() to be easier for compilers to optimize	Uros Bizjak	2023-10-16	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then add the ptr value to the base. This way, the compiler can propagate addend to the following instruction and simplify address calculation. E.g.: address calcuation in amd_pmu_enable_virt() improves from: 48 c7 c0 00 00 00 00 mov $0x0,%rax 87b7: R_X86_64_32S cpu_hw_events 65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax 00 87bf: R_X86_64_PC32 this_cpu_off-0x4 48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax) 00 00 00 00 to: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax 00 8798: R_X86_64_PC32 this_cpu_off-0x4 48 c7 80 00 00 00 00 movq $0x0,0x0(%rax) 00 00 00 00 87a6: R_X86_64_32S cpu_hw_events+0x1328 The compiler also eliminates additional redundant loads from this_cpu_off, reducing the number of percpu offset reads from 1668 to 1646 on a test build, a -1.3% reduction. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20231015202523.189168-1-ubizjak@gmail.com
* \|	x86/percpu: Use C for percpu read/write accessors	Uros Bizjak	2023-10-05	1	-11/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The percpu code mostly uses inline assembly. Using segment qualifiers allows to use C code instead, which enables the compiler to perform various optimizations (e.g. propagation of memory arguments). Convert percpu read and write accessors to C code, so the memory argument can be propagated to the instruction that uses this argument. Some examples of propagations: a) into sign/zero extensions: the code improves from: 65 8a 05 00 00 00 00 mov %gs:0x0(%rip),%al 0f b6 c0 movzbl %al,%eax to: 65 0f b6 05 00 00 00 movzbl %gs:0x0(%rip),%eax 00 and in a similar way for: movzbl %gs:0x0(%rip),%edx movzwl %gs:0x0(%rip),%esi movzbl %gs:0x78(%rbx),%eax movslq %gs:0x0(%rip),%rdx movslq %gs:(%rdi),%rbx b) into compares: the code improves from: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax a9 00 00 0f 00 test $0xf0000,%eax to: 65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip) 00 00 0f 00 and in a similar way for: testl $0xf0000,%gs:0x0(%rip) testb $0x1,%gs:0x0(%rip) testl $0xff00,%gs:0x0(%rip) cmpb $0x0,%gs:0x0(%rip) cmp %gs:0x0(%rip),%r14d cmpw $0x8,%gs:0x0(%rip) cmpb $0x0,%gs:(%rax) c) into other insns: the code improves from: 1a355: 83 fa ff cmp $0xffffffff,%edx 1a358: 75 07 jne 1a361 <...> 1a35a: 65 8b 15 00 00 00 00 mov %gs:0x0(%rip),%edx 1a361: to: 1a35a: 83 fa ff cmp $0xffffffff,%edx 1a35d: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx 1a364: 00 The above propagations result in the following code size improvements for current mainline kernel (with the default config), compiled with: # gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1) text data bss dec filename 25508862 4386540 808388 30703790 vmlinux-vanilla.o 25500922 4386532 808388 30695842 vmlinux-new.o Co-developed-by: Nadav Amit <namit@vmware.com> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20231004192404.31733-1-ubizjak@gmail.com
* \|	x86/percpu: Use compiler segment prefix qualifier	Nadav Amit	2023-10-05	1	-22/+46
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using a segment prefix qualifier is cleaner than using a segment prefix in the inline assembly, and provides the compiler with more information, telling it that __seg_gs:[addr] is different than [addr] when it analyzes data dependencies. It also enables various optimizations that will be implemented in the next patches. Use segment prefix qualifiers when they are supported. Unfortunately, gcc does not provide a way to remove segment qualifiers, which is needed to use typeof() to create local instances of the per-CPU variable. For this reason, do not use the segment qualifier for per-CPU variables, and do casting using the segment qualifier instead. Uros: Improve compiler support detection and update the patch to the current mainline. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20231004145137.86537-4-ubizjak@gmail.com
*	x86/percpu: Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op	Uros Bizjak	2023-09-21	1	-12/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The fallback alternative uses %rsi register to manually load pointer to the percpu variable before the call to the emulation function. This is unoptimal, because the load is hidden from the compiler. Move the load of %rsi outside inline asm, so the compiler can reuse the value. The code in slub.o improves from: 55ac: 49 8b 3c 24 mov (%r12),%rdi 55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx 55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx 55b8: 4c 89 f8 mov %r15,%rax 55bb: 48 8d 37 lea (%rdi),%rsi 55be: e8 00 00 00 00 callq 55c3 <...> 55bf: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 55c3: 75 a3 jne 5568 <...> 55c5: ... 0000000000000000 <.altinstr_replacement>: 5: 65 48 0f c7 0f cmpxchg16b %gs:(%rdi) to: 55ac: 49 8b 34 24 mov (%r12),%rsi 55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx 55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx 55b8: 4c 89 f8 mov %r15,%rax 55bb: e8 00 00 00 00 callq 55c0 <...> 55bc: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 55c0: 75 a6 jne 5568 <...> 55c2: ... Where the alternative replacement instruction now uses %rsi: 0000000000000000 <.altinstr_replacement>: 5: 65 48 0f c7 0e cmpxchg16b %gs:(%rsi) The instruction (effectively a reg-reg move) at 55bb: in the original assembly is removed. Also, both the CALL and replacement CMPXCHG16B are 5 bytes long, removing the need for NOPs in the asm code. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230918151452.62344-1-ubizjak@gmail.com
*	x86/percpu: Define raw_cpu_try_cmpxchg and this_cpu_try_cmpxchg()	Uros Bizjak	2023-09-15	1	-0/+27
\| \| \| \| \| \| \| \| \| \| \|	Define target-specific raw_cpu_try_cmpxchg_N() and this_cpu_try_cmpxchg_N() macros. These definitions override the generic fallback definitions and enable target-specific optimized implementations. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230830151623.3900-1-ubizjak@gmail.com
*	x86/percpu: Define {raw,this}_cpu_try_cmpxchg{64,128}	Uros Bizjak	2023-09-15	1	-0/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Define target-specific {raw,this}_cpu_try_cmpxchg64() and {raw,this}_cpu_try_cmpxchg128() macros. These definitions override the generic fallback definitions and enable target-specific optimized implementations. Several places in mm/slub.o improve from e.g.: 53bc: 48 8d 4f 40 lea 0x40(%rdi),%rcx 53c0: 48 89 fa mov %rdi,%rdx 53c3: 49 8b 5c 05 00 mov 0x0(%r13,%rax,1),%rbx 53c8: 4c 89 e8 mov %r13,%rax 53cb: 49 8d 30 lea (%r8),%rsi 53ce: e8 00 00 00 00 call 53d3 <...> 53cf: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 53d3: 48 31 d7 xor %rdx,%rdi 53d6: 4c 31 e8 xor %r13,%rax 53d9: 48 09 c7 or %rax,%rdi 53dc: 75 ae jne 538c <...> to: 53bc: 48 8d 4a 40 lea 0x40(%rdx),%rcx 53c0: 49 8b 1c 07 mov (%r15,%rax,1),%rbx 53c4: 4c 89 f8 mov %r15,%rax 53c7: 48 8d 37 lea (%rdi),%rsi 53ca: e8 00 00 00 00 call 53cf <...> 53cb: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 53cf: 75 bb jne 538c <...> reducing the size of mm/slub.o by 80 bytes: text data bss dec hex filename 39758 5337 4208 49303 c097 slub-new.o 39838 5337 4208 49383 c0e7 slub-old.o Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230906185941.53527-1-ubizjak@gmail.com
*	arch: Remove cmpxchg_double	Peter Zijlstra	2023-06-05	1	-42/+0
\| \| \| \| \| \| \| \| \| \| \|	No moar users, remove the monster. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.991907085@infradead.org
*	percpu: Wire up cmpxchg128	Peter Zijlstra	2023-06-05	1	-6/+68
\| \| \| \| \| \| \| \| \| \|	In order to replace cmpxchg_double() with the newly minted cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.654945124@infradead.org
*	x86/percpu: Remove volatile from arch_raw_cpu_ptr().	Sebastian Andrzej Siewior	2022-04-05	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The volatile attribute in the inline assembly of arch_raw_cpu_ptr() forces the compiler to always generate the code, even if the compiler can decide upfront that its result is not needed. For instance invoking __intel_pmu_disable_all(false) (like intel_pmu_snapshot_arch_branch_stack() does) leads to loading the address of &cpu_hw_events into the register while compiler knows that it has no need for it. This ends up with code like: \| movq $cpu_hw_events, %rax #, tcp_ptr__ \| add %gs:this_cpu_off(%rip), %rax # this_cpu_off, tcp_ptr__ \| xorl %eax, %eax # tmp93 It also creates additional code within local_lock() with !RT && !LOCKDEP which is not desired. By removing the volatile attribute the compiler can place the function freely and avoid it if it is not needed in the end. By using the function twice the compiler properly caches only the variable offset and always loads the CPU-offset. this_cpu_ptr() also remains properly placed within a preempt_disable() sections because - arch_raw_cpu_ptr() assembly has a memory input ("m" (this_cpu_off)) - prempt_{dis,en}able() fundamentally has a 'barrier()' in it Therefore this_cpu_ptr() is already properly serialized and does not rely on the 'volatile' attribute. Remove volatile from arch_raw_cpu_ptr(). [ bigeasy: Added Linus' explanation why this_cpu_ptr() is not moved out of a preempt_disable() section without the 'volatile' attribute. ] Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220328145810.86783-2-bigeasy@linutronix.de
*	x86/percpu: Remove unused PER_CPU() macro	Brian Gerst	2020-07-23	1	-18/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also remove now unused __percpu_mov_op. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-11-ndesaulniers@google.com
*	x86/percpu: Clean up percpu_stable_op()	Brian Gerst	2020-07-23	1	-29/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use __pcpu_size_call_return() to simplify this_cpu_read_stable(). Also remove __bad_percpu_size() which is now unused. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-10-ndesaulniers@google.com
*	x86/percpu: Clean up percpu_cmpxchg_op()	Brian Gerst	2020-07-23	1	-40/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The core percpu macros already have a switch on the data size, so the switch in the x86 code is redundant and produces more dead code. Also use appropriate types for the width of the instructions. This avoids errors when compiling with Clang. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-9-ndesaulniers@google.com
*	x86/percpu: Clean up percpu_xchg_op()	Brian Gerst	2020-07-23	1	-43/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The core percpu macros already have a switch on the data size, so the switch in the x86 code is redundant and produces more dead code. Also use appropriate types for the width of the instructions. This avoids errors when compiling with Clang. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-8-ndesaulniers@google.com
*	x86/percpu: Clean up percpu_add_return_op()	Brian Gerst	2020-07-23	1	-35/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The core percpu macros already have a switch on the data size, so the switch in the x86 code is redundant and produces more dead code. Also use appropriate types for the width of the instructions. This avoids errors when compiling with Clang. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-7-ndesaulniers@google.com
*	x86/percpu: Remove "e" constraint from XADD	Brian Gerst	2020-07-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The "e" constraint represents a constant, but the XADD instruction doesn't accept immediate operands. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-6-ndesaulniers@google.com
*	x86/percpu: Clean up percpu_add_op()	Brian Gerst	2020-07-23	1	-77/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The core percpu macros already have a switch on the data size, so the switch in the x86 code is redundant and produces more dead code. Also use appropriate types for the width of the instructions. This avoids errors when compiling with Clang. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-5-ndesaulniers@google.com
*	x86/percpu: Clean up percpu_from_op()	Brian Gerst	2020-07-23	1	-35/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The core percpu macros already have a switch on the data size, so the switch in the x86 code is redundant and produces more dead code. Also use appropriate types for the width of the instructions. This avoids errors when compiling with Clang. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-4-ndesaulniers@google.com
*	x86/percpu: Clean up percpu_to_op()	Brian Gerst	2020-07-23	1	-55/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The core percpu macros already have a switch on the data size, so the switch in the x86 code is redundant and produces more dead code. Also use appropriate types for the width of the instructions. This avoids errors when compiling with Clang. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-3-ndesaulniers@google.com
*	x86/percpu: Introduce size abstraction macros	Brian Gerst	2020-07-23	1	-0/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In preparation for cleaning up the percpu operations, define macros for abstraction based on the width of the operation. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lkml.kernel.org/r/20200720204925.3654302-2-ndesaulniers@google.com
*	x86/percpu: Optimize raw_cpu_xchg()	Peter Zijlstra	2019-06-17	1	-4/+16
\| \| \| \| \| \| \| \| \| \| \| \|	Since raw_cpu_xchg() doesn't need to be IRQ-safe, like this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg loop. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()	Peter Zijlstra	2019-06-17	1	-112/+112
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Nadav Amit reported that commit: b59167ac7baf ("x86/percpu: Fix this_cpu_read()") added a bunch of constraints to all sorts of code; and while some of that was correct and desired, some of that seems superfluous. The thing is, the this_cpu_() operations are defined IRQ-safe, this means the values are subject to change from IRQs, and thus must be reloaded. Also, the generic form: local_irq_save() __this_cpu_read() local_irq_restore() would not allow the re-use of previous values; if by nothing else, then the barrier()s implied by local_irq_(). Which raises the point that percpu_from_op() and the others also need that volatile. OTOH __this_cpu_() operations are not IRQ-safe and assume external preempt/IRQ disabling and could thus be allowed more room for optimization. This makes the this_cpu_() vs __this_cpu_*() behaviour more consistent with other architectures. $ ./compare.sh defconfig-build defconfig-build1 vmlinux.o x86_pmu_cancel_txn 80 71 -9,+0 __text_poke 919 964 +45,+0 do_user_addr_fault 1082 1058 -24,+0 __do_page_fault 1194 1178 -16,+0 do_exit 2995 3027 -43,+75 process_one_work 1008 989 -67,+48 finish_task_switch 524 505 -19,+0 __schedule_bug 103 98 -59,+54 __schedule_bug 103 98 -59,+54 __sched_setscheduler 2015 2030 +15,+0 freeze_processes 203 230 +31,-4 rcu_gp_kthread_wake 106 99 -7,+0 rcu_core 1841 1834 -7,+0 call_timer_fn 298 286 -12,+0 can_stop_idle_tick 146 139 -31,+24 perf_pending_event 253 239 -14,+0 shmem_alloc_page 209 213 +4,+0 __alloc_pages_slowpath 3284 3269 -15,+0 umount_tree 671 694 +23,+0 advance_transaction 803 798 -5,+0 con_put_char 71 51 -20,+0 xhci_urb_enqueue 1302 1295 -7,+0 xhci_urb_enqueue 1302 1295 -7,+0 tcp_sacktag_write_queue 2130 2075 -55,+0 tcp_try_undo_loss 229 208 -21,+0 tcp_v4_inbound_md5_hash 438 411 -31,+4 tcp_v4_inbound_md5_hash 438 411 -31,+4 tcp_v6_inbound_md5_hash 469 411 -33,-25 tcp_v6_inbound_md5_hash 469 411 -33,-25 restricted_pointer 434 420 -14,+0 irq_exit 162 154 -8,+0 get_perf_callchain 638 624 -14,+0 rt_mutex_trylock 169 156 -13,+0 avc_has_extended_perms 1092 1089 -3,+0 avc_has_perm_noaudit 309 306 -3,+0 __perf_sw_event 138 122 -16,+0 perf_swevent_get_recursion_context 116 102 -14,+0 __local_bh_enable_ip 93 72 -21,+0 xfrm_input 4175 4161 -14,+0 avc_has_perm 446 443 -3,+0 vm_events_fold_cpu 57 56 -1,+0 vfree 68 61 -7,+0 freeze_processes 203 230 +31,-4 _local_bh_enable 44 30 -14,+0 ip_do_fragment 1982 1944 -38,+0 do_exit 2995 3027 -43,+75 __do_softirq 742 724 -18,+0 cpu_init 1510 1489 -21,+0 account_system_time 80 79 -1,+0 total 12985281 12984819 -742,+280 Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20181206112433.GB13675@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	x86/percpu: Fix this_cpu_read()	Peter Zijlstra	2018-10-14	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Eric reported that a sequence count loop using this_cpu_read() got optimized out. This is wrong, this_cpu_read() must imply READ_ONCE() because the interface is IRQ-safe, therefore an interrupt can have changed the per-cpu value. Fixes: 7c3576d261ce ("[PATCH] i386: Convert PDA into the percpu section") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Eric Dumazet <edumazet@google.com> Cc: hpa@zytor.com Cc: eric.dumazet@gmail.com Cc: bp@alien8.de Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181011104019.748208519@infradead.org
*	x86/asm: Use CC_SET/CC_OUT in percpu_cmpxchg8b_double() to micro-optimize ↵	Uros Bizjak	2018-06-21	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	code generation Use CC_SET(z)/CC_OUT(z) instead of explicit SETZ instruction. Using these two defines, the compiler that supports generation of condition code outputs from inline assembly flags generates e.g.: cmpxchg8b %fs:(%esi) jne 172255 <__kmalloc+0x65> instead of: cmpxchg8b %fs:(%esi) sete %al test %al,%al je 172255 <__kmalloc+0x65> Note that older compilers now generate: cmpxchg8b %fs:(%esi) sete %cl test %cl,%cl je 173a85 <__kmalloc+0x65> since we have to mark that cmpxchg8b instruction outputs to %eax register and this way clobbers the value in the register. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180605163910.13015-1-ubizjak@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	x86/asm: Add instruction suffixes to bitops	Jan Beulich	2018-02-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Omitting suffixes from instructions in AT&T mode is bad practice when operand size cannot be determined by the assembler from register operands, and is likely going to be warned about by upstream gas in the future (mine does already). Add the missing suffixes here. Note that for 64-bit this means some operations change from being 32-bit to 64-bit. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/5A93F98702000078001ABACC@prv-mh.provo.novell.com