| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull more documentation updates from Jonathan Corbet:
"A handful of late-arriving documentation fixes, plus one Spanish
translation that has been ready for some time but got applied late"
* tag 'docs-6.4-2' of git://git.lwn.net/linux:
docs/sp_SP: Add translation of process/adding-syscalls
CREDITS: Update email address for Mat Martineau
Documentation: update kernel stack for x86_64
docs: Remove unnecessary unicode character
docs: fix "Reviewd" typo
Documentation: timers: hrtimers: Make hybrid union historical
docs/admin-guide/mm/ksm.rst fix intraface -> interface typo
doc:it_IT: fix some typos
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Translate Documentation/process/adding-syscalls.rst into Spanish.
Co-developed-by: Mauricio Fuentes <mauriciofb@gmail.com>
Signed-off-by: Mauricio Fuentes <mauriciofb@gmail.com>
Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
Link: https://lore.kernel.org/r/20230315143526.1213813-1-carlos.bilbao@amd.com
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Changing my email address in CREDITS to be consistent with what's in use
in MAINTAINERS and mailmap. Also removed extra date information from the
CREDITS entry since I'm a maintainer for MPTCP again.
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230502000545.2899055-1-martineau@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 6538b8ea886e ("x86_64: expand kernel stack to 16K")
expanded kernel stack for x86_64 but left the wrong documentation,
update it.
Signed-off-by: Yan Yan <yanyan.yan@antgroup.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Link: https://lore.kernel.org/r/20230328072537.20188-1-yanyan.yan@antgroup.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There is a non-printable unicode char '\u202a' or "0xe2 0x80 0xaa" in hex
in the translation doc. It is unnecessary and should be removed for better
text formatting when using editors like vi.
Signed-off-by: Tao Liu <ltao@redhat.com>
Reviewed-by: Yanteng Si <siyanteng@loongson.cn>
Link: https://lore.kernel.org/r/20230426021452.9745-1-ltao@redhat.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Spell "Reviewed" properly in Documentation/Translations/jp/SubmittingPatches
Signed-off-by: Deming Wang <wangdeming@inspur.com>
Reviewed-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/r/20230502010157.1655-1-wangdeming@inspur.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Non-scalar time was removed from the ktime hybrid union in v3.17, and
the union itself followed suit in v4.10.
Make it clear that ktime_t is always a 64bit scalar type, to avoid
confusing the casual reader.
While at it, fix a spelling mistake.
Fixes: 24e4a8c3e8868874 ("ktime: Kill non-scalar ktime_t implementation for 2038")
Fixes: 2456e855354415bf ("ktime: Get rid of the union")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/59250a3d1c2c827b5c1833169a6e652ca6a784e6.1683021785.git.geert+renesas@glider.be
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Fix typo from 'intraface' to 'interface' in admin guide for KSM.
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20230424080305.2985-1-donald.hunter@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Fix of some typos spotted reading documentation in italian and latest
changes for 6.4
Signed-off-by: Fabio Fantoni <fantonifabio@tiscali.it>
Reviewed-by: Federico Vaga <federico.vaga@vaga.pv.it>
Link: https://lore.kernel.org/r/20230425110158.9755-1-fantonifabio@tiscali.it
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull more tracing updates from Steven Rostedt:
- Make buffer_percent read/write.
The buffer_percent file is how users can state how long to block on
the tracing buffer depending on how much is in the buffer. When it
hits the "buffer_percent" it will wake the task waiting on the
buffer. For some reason it was set to read-only.
This was not noticed because testing was done as root without
SELinux, but with SELinux it will prevent even root to write to it
without having CAP_DAC_OVERRIDE.
- The "touched_functions" was added this merge window, but one of the
reasons for adding it was not implemented.
That was to show what functions were not only touched, but had either
a direct trampoline attached to it, or a kprobe or live kernel
patching that can "hijack" the function to run a different function.
The point is to know if there's functions in the kernel that may not
be behaving as the kernel code shows. This can be used for debugging.
TODO: Add this information to kernel oops too.
* tag 'trace-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Add MODIFIED flag to show if IPMODIFY or direct was attached
tracing: Fix permissions for the buffer_percent file
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If a function had ever had IPMODIFY or DIRECT attached to it, where this
is how live kernel patching and BPF overrides work, mark them and display
an "M" in the enabled_functions and touched_functions files. This can be
used for debugging. If a function had been modified and later there's a bug
in the code related to that function, this can be used to know if the cause
is possibly from a live kernel patch or a BPF program that changed the
behavior of the code.
Also update the documentation on the enabled_functions and
touched_functions output, as it was missing direct callers and CALL_OPS.
And include this new modify attribute.
Link: https://lore.kernel.org/linux-trace-kernel/20230502213233.004e3ae4@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This file defines both read and write operations, yet it is being
created as read-only. This means that it can't be written to without the
CAP_DAC_OVERRIDE capability. Fix the permissions to allow root to write
to it without the need to override DAC perms.
Link: https://lore.kernel.org/linux-trace-kernel/20230503140114.3280002-1-omosnace@redhat.com
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 03329f993978 ("tracing: Add tracefs file buffer_percentage")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce local{,64}_try_cmpxchg() - a slightly more optimal
primitive, which will be used in perf events ring-buffer code
- Simplify/modify rwsems on PREEMPT_RT, to address writer starvation
- Misc cleanups/fixes
* tag 'locking-core-2023-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/atomic: Correct (cmp)xchg() instrumentation
locking/x86: Define arch_try_cmpxchg_local()
locking/arch: Wire up local_try_cmpxchg()
locking/generic: Wire up local{,64}_try_cmpxchg()
locking/atomic: Add generic try_cmpxchg{,64}_local() support
locking/rwbase: Mitigate indefinite writer starvation
locking/arch: Rename all internal __xchg() names to __arch_xchg()
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
All xchg() and cmpxchg() ops are atomic RMWs, but currently we
instrument these with instrument_atomic_write() rather than
instrument_atomic_read_write(), missing the read aspect.
Similarly, all try_cmpxchg() ops are non-atomic RMWs on *oldp, but we
instrument these accesses with instrument_atomic_write() rather than
instrument_read_write(), missing the read aspect and erroneously marking
these as atomic.
Fix the instrumentation for both points.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20230413160644.490976-1-mark.rutland@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Define target specific arch_try_cmpxchg_local(). This
definition overrides the generic arch_try_cmpxchg_local()
fallback definition and enables target-specific
implementation of try_cmpxchg_local().
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230405141710.3551-5-ubizjak@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Implement target specific support for local_try_cmpxchg()
and local_cmpxchg() using typed C wrappers that call their
_local counterpart and provide additional checking of
their input arguments.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230405141710.3551-4-ubizjak@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Implement generic support for local{,64}_try_cmpxchg().
Redirect to the atomic_ family of functions when the target
does not provide its own local.h definitions.
For 64-bit targets, implement local64_try_cmpxchg and
local64_cmpxchg using typed C wrappers that call local_
family of functions and provide additional checking
of their input arguments.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230405141710.3551-3-ubizjak@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add generic support for try_cmpxchg{,64}_local() and their falbacks.
These provides the generic try_cmpxchg_local family of functions
from the arch_ prefixed version, also adding explicit instrumentation.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230405141710.3551-2-ubizjak@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
On PREEMPT_RT, rw_semaphore and rwlock_t locks are unfair to writers.
Readers can indefinitely acquire the lock unless the writer fully acquired
the lock, which might never happen if there is always a reader in the
critical section owning the lock.
Mel Gorman reported that since LTP-20220121 the dio_truncate test case
went from having 1 reader to having 16 readers and that number of readers
is sufficient to prevent the down_write ever succeeding while readers
exist. Eventually the test is killed after 30 minutes as a failure.
Mel proposed a timeout to limit how long a writer can be blocked until
the reader is forced into the slowpath.
Thomas argued that there is no added value by providing this timeout. From
a PREEMPT_RT point of view, there are no critical rw_semaphore or rwlock_t
locks left where the reader must be preferred.
Mitigate indefinite writer starvation by forcing the READER into the
slowpath once the WRITER attempts to acquire the lock.
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/877cwbq4cq.ffs@tglx
Link: https://lore.kernel.org/r/20230321161140.HMcQEhHb@linutronix.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Decrease the probability of this internal facility to be used by
driver code.
Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Palmer Dabbelt <palmer@rivosinc.com> [riscv]
Link: https://lore.kernel.org/r/20230118154450.73842-1-andrzej.hajda@intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Merge my x86 uaccess updates branch.
The LAM ("Linear Address Masking") updates in this release made me
unhappy about how "access_ok()" was done, and it actually turned out to
have a couple of small bugs in it too. This is my cleanup of the code:
- use the sign bit of the __user pointer rather than masking the
address and checking it against the TASK_SIZE range.
We already did this part for the get/put_user() side, but
'access_ok()' did the naïve "mask and range check" thing, which not
only generates nasty code, but also ended up meaning that __access_ok
itself didn't do a good job, and so copy_from_user_nmi() didn't get
the check right.
- move all the code that is 64-bit only into the 64-bit version of the
header file, so that we don't unnecessarily pollute the shared x86
code and make it look like LAM might work in 32-bit too.
- fix a bug in the address masking (that doesn't end up mattering: in
this case the fix was to just remove the buggy code entirely).
- a couple of trivial cleanups and added commentary about the
access_ok() rules.
* x86-uaccess-cleanup:
x86-64: mm: clarify the 'positive addresses' user address rules
x86: mm: remove 'sign' games from LAM untagged_addr*() macros
x86: uaccess: move 32-bit and 64-bit parts into proper <asm/uaccess_N.h> header
x86: mm: remove architecture-specific 'access_ok()' define
x86-64: make access_ok() independent of LAM
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Dave Hansen found the "(long) addr >= 0" code in the x86-64 access_ok
checks somewhat confusing, and suggested using a helper to clarify what
the code is doing.
So this does exactly that: clarifying what the sign bit check is all
about, by adding a helper macro that makes it clear what it is testing.
This also adds some explicit comments talking about how even with LAM
enabled, any addresses with the sign bit will still GP-fault in the
non-canonical region just above the sign bit.
This is all what allows us to do the user address checks with just the
sign bit, and furthermore be a bit cavalier about accesses that might be
done with an additional offset even past that point.
(And yes, this talks about 'positive' even though zero is also a valid
user address and so technically we should call them 'non-negative'. But
I don't think using 'non-negative' ends up being more understandable).
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The intent of the sign games was to not modify kernel addresses when
untagging them. However, that had two issues:
(a) it didn't actually work as intended, since the mask was calculated
as 'addr >> 63' on an _unsigned_ address. So instead of getting a
mask of all ones for kernel addresses, you just got '1'.
(b) untagging a kernel address isn't actually a valid operation anyway.
Now, (a) had originally been true for both 'untagged_addr()' and the
remote version of it, but had accidentally been fixed for the regular
version of untagged_addr() by commit e0bddc19ba95 ("x86/mm: Reduce
untagged_addr() overhead for systems without LAM"). That one rewrote
the shift to be part of the alternative asm code, and in the process
changed the unsigned shift into a signed 'sar' instruction.
And while it is true that we don't want to turn what looks like a kernel
address into a user address by masking off the high bit, that doesn't
need these sign masking games - all it needs is that the mm context
'untag_mask' value has the high bit set.
Which it always does.
So simplify the code by just removing the superfluous (and in the case
of untagged_addr_remote(), still buggy) sign bit games in the address
masking.
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The x86 <asm/uaccess.h> file has grown features that are specific to
x86-64 like LAM support and the related access_ok() changes. They
really should be in the <asm/uaccess_64.h> file and not pollute the
generic x86 header.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
There's already a generic definition of 'access_ok()' in the
asm-generic/access_ok.h header file, and the only difference bwteen that
and the x86-specific one is the added check for WARN_ON_IN_IRQ().
And it turns out that the reason for that check is long gone: it used to
use a "user_addr_max()" inline function that depended on the current
thread, and caused problems in non-thread contexts.
For details, see commits 7c4788950ba5 ("x86/uaccess, sched/preempt:
Verify access_ok() context") and in particular commit ae31fe51a3cc
("perf/x86: Restore TASK_SIZE check on frame pointer") about how and why
this came to be.
But that "current task" issue was removed in the big set_fs() removal by
Christoph Hellwig in commit 47058bb54b57 ("x86: remove address space
overrides using set_fs()").
So the reason for the test and the architecture-specific access_ok()
define no longer exists, and is actually harmful these days. For
example, it led various 'copy_from_user_nmi()' games (eg using
__range_not_ok() instead, and then later converted to __access_ok() when
that became ok).
And that in turn meant that LAM was broken for the frame following
before this series, because __access_ok() used to not do the address
untagging.
Accessing user state still needs care in many contexts, but access_ok()
is not the place for this test.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The linear address masking (LAM) code made access_ok() more complicated,
in that it now needs to untag the address in order to verify the access
range. See commit 74c228d20a51 ("x86/uaccess: Provide untagged_addr()
and remove tags before address check").
We were able to avoid that overhead in the get_user/put_user code paths
by simply using the sign bit for the address check, and depending on the
GP fault if the address was non-canonical, which made it all independent
of LAM.
And we can do the same thing for access_ok(): simply check that the user
pointer range has the high bit clear. No need to bother with any
address bit masking.
In fact, we can go a bit further, and just check the starting address
for known small accesses ranges: any accesses that overflow will still
be in the non-canonical area and will still GP fault.
To still make syzkaller catch any potentially unchecked user addresses,
we'll continue to warn about GP faults that are caused by accesses in
the non-canonical range. But we'll limit that to purely "high bit set
and past the one-page 'slop' area".
We could probably just do that "check only starting address" for any
arbitrary range size: realistically all kernel accesses to user space
will be done starting at the low address. But let's leave that kind of
optimization for later. As it is, this already allows us to generate
simpler code and not worry about any tag bits in the address.
The one thing to look out for is the GUP address check: instead of
actually copying data in the virtual address range (and thus bad
addresses being caught by the GP fault), GUP will look up the page
tables manually. As a result, the page table limits need to be checked,
and that was previously implicitly done by the access_ok().
With the relaxed access_ok() check, we need to just do an explicit check
for TASK_SIZE_MAX in the GUP code instead. The GUP code already needs
to do the tag bit unmasking anyway, so there this is all very
straightforward, and there are no LAM issues.
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull more RISC-V updates from Palmer Dabbelt:
- Support for hibernation
- The .rela.dyn section has been moved to the init area
- A fix for the SBI probing to allow for implementation-defined
behavior
- Various other fixes and cleanups throughout the tree
* tag 'riscv-for-linus-6.4-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: include cpufeature.h in cpufeature.c
riscv: Move .rela.dyn to the init sections
dt-bindings: riscv: explicitly mention assumption of Zicsr & Zifencei support
riscv: compat_syscall_table: Fixup compile warning
RISC-V: fixup in-flight collision with ARCH_WANT_OPTIMIZE_VMEMMAP rename
RISC-V: fix sifive and thead section mismatches in errata
RISC-V: Align SBI probe implementation with spec
riscv: mm: remove redundant parameter of create_fdt_early_page_table
riscv: Adjust dependencies of HAVE_DYNAMIC_FTRACE selection
RISC-V: Add arch functions to support hibernation/suspend-to-disk
RISC-V: mm: Enable huge page support to kernel_page_present() function
RISC-V: Factor out common code of __cpu_resume_enter()
RISC-V: Change suspend_save_csrs and suspend_restore_csrs to public function
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Automation complains:
warning: symbol '__pcpu_scope_misaligned_access_speed' was not declared. Should it be static?
cpufeature.c doesn't actually include the header of the same name, as it
had not previously used anything from it.
The per-cpu variable is declared there, so include it to silence the
complaints.
Fixes: 62a31d6e38bd ("RISC-V: hwprobe: Support probing of misaligned access performance")
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Evan Green <evan@rivosinc.com>
Link: https://lore.kernel.org/r/20230420-wound-gizzard-2b2b589d9bea@spud
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
The recent introduction of relocatable kernels prepared the move of
.rela.dyn to the init section, but actually forgot to do so, so do it
here.
Before this patch: "Freeing unused kernel image (initmem) memory: 2592K"
After this patch: "Freeing unused kernel image (initmem) memory: 6288K"
The difference corresponds to the size of the .rela.dyn section:
"[42] .rela.dyn RELA ffffffff8197e798 0127f798
000000000039c660 0000000000000018 A 47 0 8"
Fixes: 559d1e45a16d ("riscv: Use --emit-relocs in order to move .rela.dyn in init")
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20230428120932.22735-1-alexghiti@rivosinc.com
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
The dt-binding was defined before the extraction of csr access and
fence.i into their own extensions, and thus the presence of the I
base extension implies Zicsr and Zifencei.
There's no harm in adding them obviously, but for backwards
compatibility with DTs that existed prior to that extraction, software
is unable to differentiate between "i" and "i_zicsr_zifencei" without
any further information.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20230427-fence-blurred-c92fb69d4137@wendy
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
../arch/riscv/kernel/compat_syscall_table.c:12:41: warning: initialized
field overwritten [-Woverride-init]
12 | #define __SYSCALL(nr, call) [nr] = (call),
| ^
../include/uapi/asm-generic/unistd.h:567:1: note: in expansion of macro
'__SYSCALL'
567 | __SYSCALL(__NR_semget, sys_semget)
Fixes: 59c10c52f573 ("riscv: compat: syscall: Add compat_sys_call_table implementation")
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Drew Fustini <dfustini@baylibre.com>
Link: https://lore.kernel.org/r/20230501223353.2833899-1-dfustini@baylibre.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Lukas warned that ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP had been
renamed in the mm tree & that RISC-V would need a fixup as part of the
merge. The warning was missed however, and RISC-V is selecting the
orphaned Kconfig option.
Fixes: 89d77f71f493 ("Merge tag 'riscv-for-linus-6.4-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux")
Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>.
Link: https://lore.kernel.org/linux-riscv/CAKXUXMyVeg2kQK_edKHtMD3eADrDK_PKhCSVkMrLDdYgTQQ5rg@mail.gmail.com/
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/20230429-trilogy-jolly-12bf5c53d62d@spud
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
When CONFIG_MODULES is set, __init_or_module becomes <empty>, but when
CONFIG_MODULES is not set, __init_or_module becomes __init.
In the latter case, it causes section mismatch warnings:
WARNING: modpost: vmlinux.o: section mismatch in reference: riscv_fill_cpu_mfr_info (section: .text) -> sifive_errata_patch_func (section: .init.text)
WARNING: modpost: vmlinux.o: section mismatch in reference: riscv_fill_cpu_mfr_info (section: .text) -> thead_errata_patch_func (section: .init.text)
Fixes: bb3f89487fd9 ("RISC-V: hwprobe: Remove __init on probe_vendor_features()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Evan Green <evan@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20230429155247.12131-1-rdunlap@infradead.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
sbi_probe_extension() is specified with "Returns 0 if the given SBI
extension ID (EID) is not available, or 1 if it is available unless
defined as any other non-zero value by the implementation."
Additionally, sbiret.value is a long. Fix the implementation to
ensure any nonzero long value is considered a success, rather
than only positive int values.
Fixes: b9dcd9e41587 ("RISC-V: Add basic support for SBI v0.2")
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230427163626.101042-1-ajones@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
create_fdt_early_page_table() explicitly uses early_pg_dir for
32-bit fdt mapping and the pgdir parameter is redundant here.
So remove it and its caller.
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Song Shuai <suagrfillet@gmail.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Fixes: ef69d2559fe9 ("riscv: Move early dtb mapping into the fixmap region")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230426100009.685435-1-suagrfillet@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| |\ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Sia Jee Heng <jeeheng.sia@starfivetech.com> says:
This series adds RISC-V Hibernation/suspend to disk support.
Low level Arch functions were created to support hibernation.
swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
cpu state onto the stack, then calling swsusp_save() to save the memory
image.
Arch specific hibernation header is implemented and is utilized by the
arch_hibernation_header_restore() and arch_hibernation_header_save()
functions. The arch specific hibernation header consists of satp, hartid,
and the cpu_resume address. The kernel built version is also need to be
saved into the hibernation image header to making sure only the same
kernel is restore when resume.
swsusp_arch_resume() creates a temporary page table that covering only
the linear map. It copies the restore code to a 'safe' page, then start to
restore the memory image. Once completed, it restores the original
kernel's page table. It then calls into __hibernate_cpu_resume()
to restore the CPU context. Finally, it follows the normal hibernation
path back to the hibernation core.
To enable hibernation/suspend to disk into RISCV, the below config
need to be enabled:
- CONFIG_HIBERNATION
- CONFIG_ARCH_HIBERNATION_HEADER
- CONFIG_ARCH_HIBERNATION_POSSIBLE
At high-level, this series includes the following changes:
1) Change suspend_save_csrs() and suspend_restore_csrs()
to public function as these functions are common to
suspend/hibernation. (patch 1)
2) Refactor the common code in the __cpu_resume_enter() function and
__hibernate_cpu_resume() function. The common code are used by
hibernation and suspend. (patch 2)
3) Enhance kernel_page_present() function to support huge page. (patch 3)
4) Add arch/riscv low level functions to support
hibernation/suspend to disk. (patch 4)
* b4-shazam-merge:
RISC-V: Add arch functions to support hibernation/suspend-to-disk
RISC-V: mm: Enable huge page support to kernel_page_present() function
RISC-V: Factor out common code of __cpu_resume_enter()
RISC-V: Change suspend_save_csrs and suspend_restore_csrs to public function
Link: https://lore.kernel.org/r/20230330064321.1008373-1-jeeheng.sia@starfivetech.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Low level Arch functions were created to support hibernation.
swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
cpu state onto the stack, then calling swsusp_save() to save the memory
image.
Arch specific hibernation header is implemented and is utilized by the
arch_hibernation_header_restore() and arch_hibernation_header_save()
functions. The arch specific hibernation header consists of satp, hartid,
and the cpu_resume address. The kernel built version is also need to be
saved into the hibernation image header to making sure only the same
kernel is restore when resume.
swsusp_arch_resume() creates a temporary page table that covering only
the linear map. It copies the restore code to a 'safe' page, then start
to restore the memory image. Once completed, it restores the original
kernel's page table. It then calls into __hibernate_cpu_resume()
to restore the CPU context. Finally, it follows the normal hibernation
path back to the hibernation core.
To enable hibernation/suspend to disk into RISCV, the below config
need to be enabled:
- CONFIG_HIBERNATION
- CONFIG_ARCH_HIBERNATION_HEADER
- CONFIG_ARCH_HIBERNATION_POSSIBLE
Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20230330064321.1008373-5-jeeheng.sia@starfivetech.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Currently kernel_page_present() function doesn't support huge page
detection causes the function to mistakenly return false to the
hibernation core.
Add huge page detection to the function to solve the problem.
Fixes: 9e953cda5cdf ("riscv: Introduce huge page support for 32/64bit kernel")
Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20230330064321.1008373-4-jeeheng.sia@starfivetech.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
The cpu_resume() function is very similar for the suspend to disk and
suspend to ram cases. Factor out the common code into suspend_restore_csrs
macro and suspend_restore_regs macro.
Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20230330064321.1008373-3-jeeheng.sia@starfivetech.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Currently suspend_save_csrs() and suspend_restore_csrs() functions are
statically defined in the suspend.c. Change the function's attribute
to public so that the functions can be used by hibernation as well.
Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20230330064321.1008373-2-jeeheng.sia@starfivetech.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
When building allmodconfig with clang and its integrated assembler and
linking with a version of GNU ld prior to 2.36, the following link error
occurs:
riscv64-linux-gnu-ld: .init.data has both ordered [`__patchable_function_entries' in init/main.o] and unordered [`.init_array.0' in kernel/trace/trace_benchmark.o] sections
riscv64-linux-gnu-ld: final link failed: bad value
This is the same error addressed by commit 45bd8951806e ("arm64: Improve
HAVE_DYNAMIC_FTRACE_WITH_REGS selection for clang") for arm64. See that
changelog for a full description of why this error occurs with this
combination of tools.
In a similar manner as that change, restrict the
CONFIG_HAVE_DYNAMIC_FTRACE selection to combinations of tools known to
work so that there are no errors.
Link: https://github.com/ClangBuiltLinux/linux/issues/1817
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20230404-riscv-dynamic-ftrace-checks-clang-v1-1-0ce296b7d423@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|\ \ \ \ \ \ \
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
Pull more kvm updates from Paolo Bonzini:
"This includes the 6.4 changes for RISC-V, and a few bugfix patches for
other architectures. For x86, this closes a longstanding performance
issue in the newer and (usually) more scalable page table management
code.
RISC-V:
- ONE_REG interface to enable/disable SBI extensions
- Zbb extension for Guest/VM
- AIA CSR virtualization
x86:
- Fix a long-standing TDP MMU flaw, where unloading roots on a vCPU
can result in the root being freed even though the root is
completely valid and can be reused as-is (with a TLB flush).
s390:
- A couple of bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: fix race in gmap_make_secure()
KVM: s390: pv: fix asynchronous teardown for small VMs
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
RISC-V: KVM: Virtualize per-HART AIA CSRs
RISC-V: KVM: Use bitmap for irqs_pending and irqs_pending_mask
RISC-V: KVM: Add ONE_REG interface for AIA CSRs
RISC-V: KVM: Implement subtype for CSR ONE_REG interface
RISC-V: KVM: Initial skeletal support for AIA
RISC-V: KVM: Drop the _MASK suffix from hgatp.VMID mask defines
RISC-V: Detect AIA CSRs from ISA string
RISC-V: Add AIA related CSR defines
RISC-V: KVM: Allow Zbb extension for Guest/VM
RISC-V: KVM: Add ONE_REG interface to enable/disable SBI extensions
RISC-V: KVM: Alphabetize selects
KVM: RISC-V: Retry fault if vma_lookup() results become invalid
|
| |\ \ \ \ \ \ \
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
For 6.4
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
Fix a potential race in gmap_make_secure() and remove the last user of
follow_page() without FOLL_GET.
The old code is locking something it doesn't have a reference to, and
as explained by Jason and David in this discussion:
https://lore.kernel.org/linux-mm/Y9J4P%2FRNvY1Ztn0Q@nvidia.com/
it can lead to all kind of bad things, including the page getting
unmapped (MADV_DONTNEED), freed, reallocated as a larger folio and the
unlock_page() would target the wrong bit.
There is also another race with the FOLL_WRITE, which could race
between the follow_page() and the get_locked_pte().
The main point is to remove the last use of follow_page() without
FOLL_GET or FOLL_PIN, removing the races can be considered a nice
bonus.
Link: https://lore.kernel.org/linux-mm/Y9J4P%2FRNvY1Ztn0Q@nvidia.com/
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Fixes: 214d9bbcd3a6 ("s390/mm: provide memory management functions for protected KVM guests")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-Id: <20230428092753.27913-2-imbrenda@linux.ibm.com>
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
On machines without the Destroy Secure Configuration Fast UVC, the
topmost level of page tables is set aside and freed asynchronously
as last step of the asynchronous teardown.
Each gmap has a host_to_guest radix tree mapping host (userspace)
addresses (with 1M granularity) to gmap segment table entries (pmds).
If a guest is smaller than 2GB, the topmost level of page tables is the
segment table (i.e. there are only 2 levels). Replacing it means that
the pointers in the host_to_guest mapping would become stale and cause
all kinds of nasty issues.
This patch fixes the issue by disallowing asynchronous teardown for
guests with only 2 levels of page tables. Userspace should (and already
does) try using the normal destroy if the asynchronous one fails.
Update s390_replace_asce so it refuses to replace segment type ASCEs.
This is still needed in case the normal destroy VM fails.
Fixes: fb491d5500a7 ("KVM: s390: pv: asynchronous destroy for reboot")
Reviewed-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-Id: <20230421085036.52511-2-imbrenda@linux.ibm.com>
|
| |\ \ \ \ \ \ \ \
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | | |
Fix a long-standing flaw in x86's TDP MMU where unloading roots on a vCPU can
result in the root being freed even though the root is completely valid and
can be reused as-is (with a TLB flush).
|
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | | |
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
| |\ \ \ \ \ \ \ \ \
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | | |
KVM/riscv changes for 6.4
- ONE_REG interface to enable/disable SBI extensions
- Zbb extension for Guest/VM
- AIA CSR virtualization
|
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | | |
The AIA specification introduce per-HART AIA CSRs which primarily
support:
* 64 local interrupts on both RV64 and RV32
* priority for each of the 64 local interrupts
* interrupt filtering for local interrupts
This patch virtualize above mentioned AIA CSRs and also extend
ONE_REG interface to allow user-space save/restore Guest/VM
view of these CSRs.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | | |
To support 64 VCPU local interrupts on RV32 host, we should use
bitmap for irqs_pending and irqs_pending_mask in struct kvm_vcpu_arch.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|