| Commit message (Collapse) | Author | Files | Lines |
|
Replace uses of sev_es_active() with the more generic cc_platform_has()
using CC_ATTR_GUEST_STATE_ENCRYPT. If future support is added for other
memory encyrption techonologies, the use of CC_ATTR_GUEST_STATE_ENCRYPT
can be updated, as required.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-8-bp@alien8.de
|
|
Replace uses of sev_active() with the more generic cc_platform_has()
using CC_ATTR_GUEST_MEM_ENCRYPT. If future support is added for other
memory encryption technologies, the use of CC_ATTR_GUEST_MEM_ENCRYPT
can be updated, as required.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-7-bp@alien8.de
|
|
Replace uses of sme_active() with the more generic cc_platform_has()
using CC_ATTR_HOST_MEM_ENCRYPT. If future support is added for other
memory encryption technologies, the use of CC_ATTR_HOST_MEM_ENCRYPT
can be updated, as required.
This also replaces two usages of sev_active() that are really geared
towards detecting if SME is active.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-6-bp@alien8.de
|
|
Introduce a powerpc version of the cc_platform_has() function. This will
be used to replace the powerpc mem_encrypt_active() implementation, so
the implementation will initially only support the CC_ATTR_MEM_ENCRYPT
attribute.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lkml.kernel.org/r/20210928191009.32551-5-bp@alien8.de
|
|
Introduce an x86 version of the cc_platform_has() function. This will be
used to replace vendor specific calls like sme_active(), sev_active(),
etc.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-4-bp@alien8.de
|
|
In preparation for other confidential computing technologies, introduce
a generic helper function, cc_platform_has(), that can be used to
check for specific active confidential computing attributes, like
memory encryption. This is intended to eliminate having to add multiple
technology-specific checks to the code (e.g. if (sev_active() ||
tdx_active() || ... ).
[ bp: s/_CC_PLATFORM_H/_LINUX_CC_PLATFORM_H/g ]
Co-developed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-3-bp@alien8.de
|
|
In preparation for other uses of the cc_platform_has() function
besides AMD's memory encryption support, selectively build the
AMD memory encryption architecture override functions only when
CONFIG_AMD_MEM_ENCRYPT=y. These functions are:
- early_memremap_pgprot_adjust()
- arch_memremap_can_ram_remap()
Additionally, routines that are only invoked by these architecture
override functions can also be conditionally built. These functions are:
- memremap_should_map_decrypted()
- memremap_is_efi_data()
- memremap_is_setup_data()
- early_memremap_is_setup_data()
And finally, phys_mem_access_encrypted() is conditionally built as well,
but requires a static inline version of it when CONFIG_AMD_MEM_ENCRYPT is
not set.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-2-bp@alien8.de
|
|
|
|
In commit b212921b13bd ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
load_elf_interp.
Unfortunately, this will cause kernel to fail to start with:
1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
Failed to execute /init (error -17)
The reason is that the elf interpreter (ld.so) has overlapping segments.
readelf -l ld-2.31.so
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000002c94c 0x000000000002c94c R E 0x10000
LOAD 0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
0x00000000000021e8 0x0000000000002320 RW 0x10000
LOAD 0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
0x00000000000011ac 0x0000000000001328 RW 0x10000
The reason for this problem is the same as described in commit
ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments").
Not only executable binaries, elf interpreters (e.g. ld.so) can have
overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
back to MAP_FIXED in load_elf_interp.
Fixes: 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map")
Cc: <stable@vger.kernel.org> # v4.19
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chen Jingwen <chenjingwen6@huawei.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The objtool warning that the kvm instruction emulation code triggered
wasn't very useful:
arch/x86/kvm/emulate.o: warning: objtool: __ex_table+0x4: don't know how to handle reloc symbol type: kvm_fastop_exception
in that it helpfully tells you which symbol name it had trouble figuring
out the relocation for, but it doesn't actually say what the unknown
symbol type was that triggered it all.
In this case it was because of missing type information (type 0, aka
STT_NOTYPE), but on the whole it really should just have printed that
out as part of the message.
Because if this warning triggers, that's very much the first thing you
want to know - why did reloc2sec_off() return failure for that symbol?
So rather than just saying you can't handle some type of symbol without
saying what the type _was_, just print out the type number too.
Fixes: 24ff65257375 ("objtool: Teach get_alt_entry() about more relocation types")
Link: https://lore.kernel.org/lkml/CAHk-=wiZwq-0LknKhXN4M+T8jbxn_2i9mcKpO+OaBSSq_Eh7tg@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The recent change to make objtool aware of more symbol relocation types
(commit 24ff65257375: "objtool: Teach get_alt_entry() about more
relocation types") also added another check, and resulted in this
objtool warning when building kvm on x86:
arch/x86/kvm/emulate.o: warning: objtool: __ex_table+0x4: don't know how to handle reloc symbol type: kvm_fastop_exception
The reason seems to be that kvm_fastop_exception() is marked as a global
symbol, which causes the relocation to ke kept around for objtool. And
at the same time, the kvm_fastop_exception definition (which is done as
an inline asm statement) doesn't actually set the type of the global,
which then makes objtool unhappy.
The minimal fix is to just not mark kvm_fastop_exception as being a
global symbol. It's only used in that one compilation unit anyway, so
it was always pointless. That's how all the other local exception table
labels are done.
I'm not entirely happy about the kinds of games that the kvm code plays
with doing its own exception handling, and the fact that it confused
objtool is most definitely a symptom of the code being a bit too subtle
and ad-hoc. But at least this trivial one-liner makes objtool no longer
upset about what is going on.
Fixes: 24ff65257375 ("objtool: Teach get_alt_entry() about more relocation types")
Link: https://lore.kernel.org/lkml/CAHk-=wiZwq-0LknKhXN4M+T8jbxn_2i9mcKpO+OaBSSq_Eh7tg@mail.gmail.com/
Cc: Borislav Petkov <bp@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In cachefiles_mark_object_buried, the dentry in question may not have an
owner, and thus our cachefiles_object pointer may be NULL when calling
the tracepoint, in which case we will also not have a valid debug_id to
print in the tracepoint.
Check for NULL object in the tracepoint and if so, just set debug_id to
MAX_UINT as was done in 2908f5e101e3 ("fscache: Add a cookie debug ID
and use that in traces").
This fixes the following oops:
FS-Cache: Cache "mycache" added (type cachefiles)
CacheFiles: File cache on vdc registered
...
Workqueue: fscache_object fscache_object_work_func [fscache]
RIP: 0010:trace_event_raw_event_cachefiles_mark_buried+0x4e/0xa0 [cachefiles]
....
Call Trace:
cachefiles_mark_object_buried+0xa5/0xb0 [cachefiles]
cachefiles_bury_object+0x270/0x430 [cachefiles]
cachefiles_walk_to_object+0x195/0x9c0 [cachefiles]
cachefiles_lookup_object+0x5a/0xc0 [cachefiles]
fscache_look_up_object+0xd7/0x160 [fscache]
fscache_object_work_func+0xb2/0x340 [fscache]
process_one_work+0x1f1/0x390
worker_thread+0x53/0x3e0
kthread+0x127/0x150
Fixes: 2908f5e101e3 ("fscache: Add a cookie debug ID and use that in traces")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
5.15-rc1 crashes with blank screen when booting up on two ThinkPads
using i915. Bisections converge convincingly, but arrive at different
and suprising "culprits", none of them the actual culprit.
netconsole (with init_netconsole() hacked to call i915_init() when
logging has started, instead of by module_init()) tells the story:
kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!
with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify().
I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that
function needs to be 4-byte aligned.
Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
structure field
If driver read tmp value sufficient for
(tmp & 0x08) && (!(tmp & 0x80)) && ((tmp & 0x7) == ((tmp >> 4) & 0x7))
from device then Null pointer dereference occurs.
(It is possible if tmp = 0b0xyz1xyz, where same literals mean same numbers)
Also lm75[] does not serve a purpose anymore after switching to
devm_i2c_new_dummy_device() in w83791d_detect_subclients().
The patch fixes possible NULL pointer dereference by removing lm75[].
Found by Linux Driver Verification project (linuxtesting.org).
Cc: stable@vger.kernel.org
Signed-off-by: Nadezda Lutovinova <lutovinova@ispras.ru>
Link: https://lore.kernel.org/r/20210921155153.28098-3-lutovinova@ispras.ru
[groeck: Dropped unnecessary continuation lines, fixed multi-line alignments]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
structure field
If driver read val value sufficient for
(val & 0x08) && (!(val & 0x80)) && ((val & 0x7) == ((val >> 4) & 0x7))
from device then Null pointer dereference occurs.
(It is possible if tmp = 0b0xyz1xyz, where same literals mean same numbers)
Also lm75[] does not serve a purpose anymore after switching to
devm_i2c_new_dummy_device() in w83791d_detect_subclients().
The patch fixes possible NULL pointer dereference by removing lm75[].
Found by Linux Driver Verification project (linuxtesting.org).
Cc: stable@vger.kernel.org
Signed-off-by: Nadezda Lutovinova <lutovinova@ispras.ru>
Link: https://lore.kernel.org/r/20210921155153.28098-2-lutovinova@ispras.ru
[groeck: Dropped unnecessary continuation lines, fixed multipline alignment]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
structure field
If driver read val value sufficient for
(val & 0x08) && (!(val & 0x80)) && ((val & 0x7) == ((val >> 4) & 0x7))
from device then Null pointer dereference occurs.
(It is possible if tmp = 0b0xyz1xyz, where same literals mean same numbers)
Also lm75[] does not serve a purpose anymore after switching to
devm_i2c_new_dummy_device() in w83791d_detect_subclients().
The patch fixes possible NULL pointer dereference by removing lm75[].
Found by Linux Driver Verification project (linuxtesting.org).
Cc: stable@vger.kernel.org
Signed-off-by: Nadezda Lutovinova <lutovinova@ispras.ru>
Link: https://lore.kernel.org/r/20210921155153.28098-1-lutovinova@ispras.ru
[groeck: Dropped unnecessary continuation lines, fixed multi-line alignment]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Add missed attribute for reading POUT from page 1.
It is supported by device, but has been missed in initial commit.
Fixes: 2c6fcbb21149 ("hwmon: (pmbus) Add support for MPS Multi-phase mp2975 controller")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Link: https://lore.kernel.org/r/20210927070740.2149290-1-vadimp@nvidia.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
The bytes for max_power_out from the ibm-cffps devices differ in byte
order for some power supplies.
The Witherspoon power supply returns the bytes in MSB/LSB order.
The Rainier power supply returns the bytes in LSB/MSB order.
The Witherspoon power supply uses version cffps1. The Rainier power
supply should use version cffps2. If version is cffps1, swap the bytes
before output to max_power_out.
Tested:
Witherspoon before: 3148. Witherspoon after: 3148.
Rainier before: 53255. Rainier after: 2000.
Signed-off-by: Brandon Wyman <bjwyman@gmail.com>
Reviewed-by: Eddie James <eajames@linux.ibm.com>
Link: https://lore.kernel.org/r/20210928205051.1222815-1-bjwyman@gmail.com
[groeck: Replaced yoda programming]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
The P10 (temp sensor version 0x10) doesn't do the same VRM status
reporting that was used on P9. It just reports the temperature, so
drop the check for VRM fru type in the sysfs show function, and don't
set the name to "alarm".
Fixes: db4919ec86 ("hwmon: (occ) Add new temperature sensor type")
Signed-off-by: Eddie James <eajames@linux.ibm.com>
Link: https://lore.kernel.org/r/20210929153604.14968-1-eajames@linux.ibm.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Because Rui is now going to focus on work that is not related to the
maintenance of the thermal subsystem in the kernel, Rafael will start
to help Daniel with handling the development process as a new member
of the thermal maintainers team. Rui will continue to review patches
in that area.
The thermal development process flow will change so that the material
from the thermal git tree will be merged into the thermal branch of
the linux-pm.git tree before going into the mainline.
Update the information in MAINTAINERS accordingly.
Signed-off-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We have never supported fasync properly, it would only fire when there
is something polling io_uring making it useless. The original support came
in through the initial io_uring merge for 5.1. Since it's broken and
nobody has reported it, get rid of the fasync bits.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f7ca3d344d406d34fa6713824198915c41cea86.1633080236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
vmlinux.o: warning: objtool: check_preemption_disabled()+0x81: call to is_percpu_thread() leaves .noinstr.text section
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928084218.063371959@infradead.org
|
|
This patch null-terminates the temporary buffer in sched_scaling_write()
so kstrtouint() does not return failure and checks the value is valid.
Before:
$ cat /sys/kernel/debug/sched/tunable_scaling
1
$ echo 0 > /sys/kernel/debug/sched/tunable_scaling
-bash: echo: write error: Invalid argument
$ cat /sys/kernel/debug/sched/tunable_scaling
1
After:
$ cat /sys/kernel/debug/sched/tunable_scaling
1
$ echo 0 > /sys/kernel/debug/sched/tunable_scaling
$ cat /sys/kernel/debug/sched/tunable_scaling
0
$ echo 3 > /sys/kernel/debug/sched/tunable_scaling
-bash: echo: write error: Invalid argument
Fixes: 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210927114635.GH3959@techsingularity.net
|
|
Since commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to
list on unthrottle") we add cfs_rqs with no runnable tasks but not fully
decayed into the load (leaf) list. We may ignore adding some ancestors
and therefore breaking tmp_alone_branch invariant. This broke LTP test
cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8a2
("sched/fair: Ensure that the CFS parent is added after unthrottling").
I noticed the named test still fails even with the fix (but with low
probability, 1 in ~1000 executions of the test). The reason is when
bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of
the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly.
Fix this by adding ancestors if we notice the unthrottled cfs_rq was
added to the load list.
Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Odin Ugedal <odin@uged.al>
Link: https://lore.kernel.org/r/20210917153037.11176-1-mkoutny@suse.com
|
|
Users of rdpmc rely on the mmapped user page to calculate accurate
time_enabled. Currently, userpage->time_enabled is only updated when the
event is added to the pmu. As a result, inactive event (due to counter
multiplexing) does not have accurate userpage->time_enabled. This can
be reproduced with something like:
/* open 20 task perf_event "cycles", to create multiplexing */
fd = perf_event_open(); /* open task perf_event "cycles" */
userpage = mmap(fd); /* use mmap and rdmpc */
while (true) {
time_enabled_mmap = xxx; /* use logic in perf_event_mmap_page */
time_enabled_read = read(fd).time_enabled;
if (time_enabled_mmap > time_enabled_read)
BUG();
}
Fix this by updating userpage for inactive events in merge_sched_in.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-and-tested-by: Lucian Grijincu <lucian@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210929194313.2398474-1-songliubraving@fb.com
|
|
According to the latest event list, the event encoding 0xEF is only
available on the first 4 counters. Add it into the event constraints
table.
Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1632842343-25862-1-git-send-email-kan.liang@linux.intel.com
|
|
perf_init_event tries multiple init callbacks and does not reset the
event state between tries. When x86_pmu_event_init runs, it
unconditionally sets the destroy callback to hw_perf_event_destroy. On
the next init attempt after x86_pmu_event_init, in perf_try_init_event,
if the pmu's capabilities includes PERF_PMU_CAP_NO_EXCLUDE, the destroy
callback will be run. However, if the next init didn't set the destroy
callback, hw_perf_event_destroy will be run (since the callback wasn't
reset).
Looking at other pmu init functions, the common pattern is to only set
the destroy callback on a successful init. Resetting the callback on
failure tries to replicate that pattern.
This was discovered after commit f11dd0d80555 ("perf/x86/amd/ibs: Extend
PERF_PMU_CAP_NO_EXCLUDE to IBS Op") when the second (and only second)
run of the perf tool after a reboot results in 0 samples being
generated. The extra run of hw_perf_event_destroy results in
active_events having an extra decrement on each perf run. The second run
has active_events == 0 and every subsequent run has active_events < 0.
When active_events == 0, the NMI handler will early-out and not record
any samples.
Signed-off-by: Anand K Mistry <amistry@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210929170405.1.I078b98ee7727f9ae9d6df8262bad7e325e40faf0@changeid
|
|
Occasionally objtool encounters symbol (as opposed to section)
relocations in .altinstructions. Typically they are the alternatives
written by elf_add_alternative() as encountered on a noinstr
validation run on vmlinux after having already ran objtool on the
individual .o files.
Basically this is the counterpart of commit 44f6a7c0755d ("objtool:
Fix seg fault with Clang non-section symbols"), because when these new
assemblers (binutils now also does this) strip the section symbols,
elf_add_reloc_to_insn() is forced to emit symbol based relocations.
As such, teach get_alt_entry() about different relocation types.
Fixes: 9bc0bb50727c ("objtool/x86: Rewrite retpoline thunk calls")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/YVWUvknIEVNkPvnP@hirez.programming.kicks-ass.net
|
|
Commit 8e33fadf945a ("ext4: remove an unnecessary if statement in
__ext4_get_inode_loc()") forget to recheck buffer's uptodate bit again
under buffer lock, which may overwrite the buffer if someone else have
already brought it uptodate and changed it.
Fixes: 8e33fadf945a ("ext4: remove an unnecessary if statement in __ext4_get_inode_loc()")
Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210910080316.70421-1-yi.zhang@huawei.com
|
|
When ext4_htree_fill_tree() fails, ext4_dx_readdir() can run into an
infinite loop since if info->last_pos != ctx->pos this will reset the
directory scan and reread the failing entry. For example:
1. a dx_dir which has 3 block, block 0 as dx_root block, block 1/2 as
leaf block which own the ext4_dir_entry_2
2. block 1 read ok and call_filldir which will fill the dirent and update
the ctx->pos
3. block 2 read fail, but we has already fill some dirent, so we will
return back to userspace will a positive return val(see ksys_getdents64)
4. the second ext4_dx_readdir will reset the world since info->last_pos
!= ctx->pos, and will also init the curr_hash which pos to block 1
5. So we will read block1 too, and once block2 still read fail, we can
only fill one dirent because the hash of the entry in block1(besides
the last one) won't greater than curr_hash
6. this time, we forget update last_pos too since the read for block2
will fail, and since we has got the one entry, ksys_getdents64 can
return success
7. Latter we will trapped in a loop with step 4~6
Cc: stable@kernel.org
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210914111415.3921954-1-yangerkun@huawei.com
|
|
The error path in ext4_fill_super forget to flush s_error_work before
journal destroy, and it may trigger the follow bug since
flush_stashed_error_work can run concurrently with journal destroy
without any protection for sbi->s_journal.
[32031.740193] EXT4-fs (loop66): get root inode failed
[32031.740484] EXT4-fs (loop66): mount failed
[32031.759805] ------------[ cut here ]------------
[32031.759807] kernel BUG at fs/jbd2/transaction.c:373!
[32031.760075] invalid opcode: 0000 [#1] SMP PTI
[32031.760336] CPU: 5 PID: 1029268 Comm: kworker/5:1 Kdump: loaded
4.18.0
[32031.765112] Call Trace:
[32031.765375] ? __switch_to_asm+0x35/0x70
[32031.765635] ? __switch_to_asm+0x41/0x70
[32031.765893] ? __switch_to_asm+0x35/0x70
[32031.766148] ? __switch_to_asm+0x41/0x70
[32031.766405] ? _cond_resched+0x15/0x40
[32031.766665] jbd2__journal_start+0xf1/0x1f0 [jbd2]
[32031.766934] jbd2_journal_start+0x19/0x20 [jbd2]
[32031.767218] flush_stashed_error_work+0x30/0x90 [ext4]
[32031.767487] process_one_work+0x195/0x390
[32031.767747] worker_thread+0x30/0x390
[32031.768007] ? process_one_work+0x390/0x390
[32031.768265] kthread+0x10d/0x130
[32031.768521] ? kthread_flush_work_fn+0x10/0x10
[32031.768778] ret_from_fork+0x35/0x40
static int start_this_handle(...)
BUG_ON(journal->j_flags & JBD2_UNMOUNT); <---- Trigger this
Besides, after we enable fast commit, ext4_fc_replay can add work to
s_error_work but return success, so the latter journal destroy in
ext4_load_journal can trigger this problem too.
Fix this problem with two steps:
1. Call ext4_commit_super directly in ext4_handle_error for the case
that called from ext4_fc_replay
2. Since it's hard to pair the init and flush for s_error_work, we'd
better add a extras flush_work before journal destroy in
ext4_fill_super
Besides, this patch will call ext4_commit_super in ext4_handle_error for
any nojournal case too. But it seems safe since the reason we call
schedule_work was that we should save error info to sb through journal
if available. Conversely, for the nojournal case, it seems useless delay
commit superblock to s_error_work.
Fixes: c92dc856848f ("ext4: defer saving error info from atomic context")
Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
Cc: stable@kernel.org
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210924093917.1953239-1-yangerkun@huawei.com
|
|
We should use unsigned long long rather than loff_t to avoid
overflow in ext4_max_bitmap_size() for comparison before returning.
w/o this patch sbi->s_bitmap_maxbytes was becoming a negative
value due to overflow of upper_limit (with has_huge_files as true)
Below is a quick test to trigger it on a 64KB pagesize system.
sudo mkfs.ext4 -b 65536 -O ^has_extents,^64bit /dev/loop2
sudo mount /dev/loop2 /mnt
sudo echo "hello" > /mnt/hello -> This will error out with
"echo: write error: File too large"
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/594f409e2c543e90fd836b78188dfa5c575065ba.1622867594.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When ext4_insert_delayed block receives and recovers from an error from
ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
space it has reserved for that block insertion as it should. One effect
of this bug is that s_dirtyclusters_counter is not decremented and
remains incorrectly elevated until the file system has been unmounted.
This can result in premature ENOSPC returns and apparent loss of free
space.
Another effect of this bug is that
/sys/fs/ext4/<dev>/delayed_allocation_blocks can remain non-zero even
after syncfs has been executed on the filesystem.
Besides, add check for s_dirtyclusters_counter when inode is going to be
evicted and freed. s_dirtyclusters_counter can still keep non-zero until
inode is written back in .evict_inode(), and thus the check is delayed
to .destroy_inode().
Fixes: 51865fda28e5 ("ext4: let ext4 maintain extent status tree")
Cc: stable@kernel.org
Suggested-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210823061358.84473-1-jefflexu@linux.alibaba.com
|
|
Now EXT4_FC_TAG_ADD_RANGE uses ext4_extent to track the
newly-added blocks, but the limit on the max value of
ee_len field is ignored, and it can lead to BUG_ON as
shown below when running command "fallocate -l 128M file"
on a fast_commit-enabled fs:
kernel BUG at fs/ext4/ext4_extents.h:199!
invalid opcode: 0000 [#1] SMP PTI
CPU: 3 PID: 624 Comm: fallocate Not tainted 5.14.0-rc6+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:ext4_fc_write_inode_data+0x1f3/0x200
Call Trace:
? ext4_fc_write_inode+0xf2/0x150
ext4_fc_commit+0x93b/0xa00
? ext4_fallocate+0x1ad/0x10d0
ext4_sync_file+0x157/0x340
? ext4_sync_file+0x157/0x340
vfs_fsync_range+0x49/0x80
do_fsync+0x3d/0x70
__x64_sys_fsync+0x14/0x20
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Simply fixing it by limiting the number of blocks
in one EXT4_FC_TAG_ADD_RANGE TLV.
Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Cc: stable@kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210820044505.474318-1-houtao1@huawei.com
|
|
The kmalloc() does not have a NULL check. This code can be re-written
slightly cleaner to just use the kstrdup().
Fixes: 265fd1991c1d ("ksmbd: use LOOKUP_BENEATH to prevent the out of share access")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Validate that the transform and smb request headers are present
before checking OriginalMessageSize and SessionId fields.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add buffer validation for SMB2_CREATE_CONTEXT.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This patch add validation to check request buffer check in smb2
negotiate and fix null pointer deferencing oops in smb3_preauth_hash_rsp()
that found from manual test.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add buffer validation in smb2_set_info, and remove unused variable
in set_file_basic_info. and smb2_set_info infolevel functions take
structure pointer argument.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Use correct basic info level in set/get_file_basic_info().
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations
are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred.
In order to fix this issue, this patch adds a new spinlock that needs
to be used whenever these fields are read or written.
Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently
reading sk->sk_peer_pid which makes no sense, as this field
is only possibly set by AF_UNIX sockets.
We will have to clean this in a separate patch.
This could be done by reverting b48596d1dc25 "Bluetooth: L2CAP: Add get_peer_pid callback"
or implementing what was truly expected.
Fixes: 109f6e39fa07 ("af_unix: Allow SO_PEERCRED to work across namespaces.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When STMMAC is paired with Energy-Efficient Ethernet(EEE) capable PHY,
and the PHY is advertising EEE by default, we need to enable EEE on the
xPCS side too, instead of having user to manually trigger the enabling
config via ethtool.
Fixed this by adding xpcs_config_eee() call in stmmac_eee_init().
Fixes: 7617af3d1a5e ("net: pcs: Introducing support for DWC xpcs Energy Efficient Ethernet")
Cc: Michael Sit Wei Hong <michael.wei.hong.sit@intel.com>
Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
struct dev_addr_list is used for device addresses, unicast addresses
and multicast addresses. The first of those needs special handling
of the main address - netdev->dev_addr points directly the data
of the entry and drivers write to it freely, so we can't maintain
it in the rbtree (for now, at least, to be fixed in net-next).
Current work around sprinkles special handling of the first
address on the list throughout the code but it missed the case
where address is being added. First address will not be visible
during subsequent adds.
Syzbot found a warning where unicast addresses are modified
without holding the rtnl lock, tl;dr is that team generates
the same modification multiple times, not necessarily when
right locks are held.
In the repro we have:
macvlan -> team -> veth
macvlan adds a unicast address to the team. Team then pushes
that address down to its memebers (veths). Next something unrelated
makes team sync member addrs again, and because of the bug
the addr entries get duplicated in the veths. macvlan gets
removed, removes its addr from team which removes only one
of the duplicated addresses from veths. This removal is done
under rtnl. Next syzbot uses iptables to add a multicast addr
to team (which does not hold rtnl lock). Team syncs veth addrs,
but because veths' unicast list still has the duplicate it will
also get sync, even though this update is intended for mc addresses.
Again, uc address updates need rtnl lock, boom.
Reported-by: syzbot+7a2ab2cdc14d134de553@syzkaller.appspotmail.com
Fixes: 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically.")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Patch that refactored fl_walk() to use idr_for_each_entry_continue_ul()
also removed rcu protection of individual filters which causes following
use-after-free when filter is deleted concurrently. Fix fl_walk() to obtain
rcu read lock while iterating and taking the filter reference and temporary
release the lock while calling arg->fn() callback that can sleep.
KASAN trace:
[ 352.773640] ==================================================================
[ 352.775041] BUG: KASAN: use-after-free in fl_walk+0x159/0x240 [cls_flower]
[ 352.776304] Read of size 4 at addr ffff8881c8251480 by task tc/2987
[ 352.777862] CPU: 3 PID: 2987 Comm: tc Not tainted 5.15.0-rc2+ #2
[ 352.778980] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 352.781022] Call Trace:
[ 352.781573] dump_stack_lvl+0x46/0x5a
[ 352.782332] print_address_description.constprop.0+0x1f/0x140
[ 352.783400] ? fl_walk+0x159/0x240 [cls_flower]
[ 352.784292] ? fl_walk+0x159/0x240 [cls_flower]
[ 352.785138] kasan_report.cold+0x83/0xdf
[ 352.785851] ? fl_walk+0x159/0x240 [cls_flower]
[ 352.786587] kasan_check_range+0x145/0x1a0
[ 352.787337] fl_walk+0x159/0x240 [cls_flower]
[ 352.788163] ? fl_put+0x10/0x10 [cls_flower]
[ 352.789007] ? __mutex_unlock_slowpath.constprop.0+0x220/0x220
[ 352.790102] tcf_chain_dump+0x231/0x450
[ 352.790878] ? tcf_chain_tp_delete_empty+0x170/0x170
[ 352.791833] ? __might_sleep+0x2e/0xc0
[ 352.792594] ? tfilter_notify+0x170/0x170
[ 352.793400] ? __mutex_unlock_slowpath.constprop.0+0x220/0x220
[ 352.794477] tc_dump_tfilter+0x385/0x4b0
[ 352.795262] ? tc_new_tfilter+0x1180/0x1180
[ 352.796103] ? __mod_node_page_state+0x1f/0xc0
[ 352.796974] ? __build_skb_around+0x10e/0x130
[ 352.797826] netlink_dump+0x2c0/0x560
[ 352.798563] ? netlink_getsockopt+0x430/0x430
[ 352.799433] ? __mutex_unlock_slowpath.constprop.0+0x220/0x220
[ 352.800542] __netlink_dump_start+0x356/0x440
[ 352.801397] rtnetlink_rcv_msg+0x3ff/0x550
[ 352.802190] ? tc_new_tfilter+0x1180/0x1180
[ 352.802872] ? rtnl_calcit.isra.0+0x1f0/0x1f0
[ 352.803668] ? tc_new_tfilter+0x1180/0x1180
[ 352.804344] ? _copy_from_iter_nocache+0x800/0x800
[ 352.805202] ? kasan_set_track+0x1c/0x30
[ 352.805900] netlink_rcv_skb+0xc6/0x1f0
[ 352.806587] ? rht_deferred_worker+0x6b0/0x6b0
[ 352.807455] ? rtnl_calcit.isra.0+0x1f0/0x1f0
[ 352.808324] ? netlink_ack+0x4d0/0x4d0
[ 352.809086] ? netlink_deliver_tap+0x62/0x3d0
[ 352.809951] netlink_unicast+0x353/0x480
[ 352.810744] ? netlink_attachskb+0x430/0x430
[ 352.811586] ? __alloc_skb+0xd7/0x200
[ 352.812349] netlink_sendmsg+0x396/0x680
[ 352.813132] ? netlink_unicast+0x480/0x480
[ 352.813952] ? __import_iovec+0x192/0x210
[ 352.814759] ? netlink_unicast+0x480/0x480
[ 352.815580] sock_sendmsg+0x6c/0x80
[ 352.816299] ____sys_sendmsg+0x3a5/0x3c0
[ 352.817096] ? kernel_sendmsg+0x30/0x30
[ 352.817873] ? __ia32_sys_recvmmsg+0x150/0x150
[ 352.818753] ___sys_sendmsg+0xd8/0x140
[ 352.819518] ? sendmsg_copy_msghdr+0x110/0x110
[ 352.820402] ? ___sys_recvmsg+0xf4/0x1a0
[ 352.821110] ? __copy_msghdr_from_user+0x260/0x260
[ 352.821934] ? _raw_spin_lock+0x81/0xd0
[ 352.822680] ? __handle_mm_fault+0xef3/0x1b20
[ 352.823549] ? rb_insert_color+0x2a/0x270
[ 352.824373] ? copy_page_range+0x16b0/0x16b0
[ 352.825209] ? perf_event_update_userpage+0x2d0/0x2d0
[ 352.826190] ? __fget_light+0xd9/0xf0
[ 352.826941] __sys_sendmsg+0xb3/0x130
[ 352.827613] ? __sys_sendmsg_sock+0x20/0x20
[ 352.828377] ? do_user_addr_fault+0x2c5/0x8a0
[ 352.829184] ? fpregs_assert_state_consistent+0x52/0x60
[ 352.830001] ? exit_to_user_mode_prepare+0x32/0x160
[ 352.830845] do_syscall_64+0x35/0x80
[ 352.831445] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 352.832331] RIP: 0033:0x7f7bee973c17
[ 352.833078] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[ 352.836202] RSP: 002b:00007ffcbb368e28 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 352.837524] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7bee973c17
[ 352.838715] RDX: 0000000000000000 RSI: 00007ffcbb368e50 RDI: 0000000000000003
[ 352.839838] RBP: 00007ffcbb36d090 R08: 00000000cea96d79 R09: 00007f7beea34a40
[ 352.841021] R10: 00000000004059bb R11: 0000000000000246 R12: 000000000046563f
[ 352.842208] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcbb36d088
[ 352.843784] Allocated by task 2960:
[ 352.844451] kasan_save_stack+0x1b/0x40
[ 352.845173] __kasan_kmalloc+0x7c/0x90
[ 352.845873] fl_change+0x282/0x22db [cls_flower]
[ 352.846696] tc_new_tfilter+0x6cf/0x1180
[ 352.847493] rtnetlink_rcv_msg+0x471/0x550
[ 352.848323] netlink_rcv_skb+0xc6/0x1f0
[ 352.849097] netlink_unicast+0x353/0x480
[ 352.849886] netlink_sendmsg+0x396/0x680
[ 352.850678] sock_sendmsg+0x6c/0x80
[ 352.851398] ____sys_sendmsg+0x3a5/0x3c0
[ 352.852202] ___sys_sendmsg+0xd8/0x140
[ 352.852967] __sys_sendmsg+0xb3/0x130
[ 352.853718] do_syscall_64+0x35/0x80
[ 352.854457] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 352.855830] Freed by task 7:
[ 352.856421] kasan_save_stack+0x1b/0x40
[ 352.857139] kasan_set_track+0x1c/0x30
[ 352.857854] kasan_set_free_info+0x20/0x30
[ 352.858609] __kasan_slab_free+0xed/0x130
[ 352.859348] kfree+0xa7/0x3c0
[ 352.859951] process_one_work+0x44d/0x780
[ 352.860685] worker_thread+0x2e2/0x7e0
[ 352.861390] kthread+0x1f4/0x220
[ 352.862022] ret_from_fork+0x1f/0x30
[ 352.862955] Last potentially related work creation:
[ 352.863758] kasan_save_stack+0x1b/0x40
[ 352.864378] kasan_record_aux_stack+0xab/0xc0
[ 352.865028] insert_work+0x30/0x160
[ 352.865617] __queue_work+0x351/0x670
[ 352.866261] rcu_work_rcufn+0x30/0x40
[ 352.866917] rcu_core+0x3b2/0xdb0
[ 352.867561] __do_softirq+0xf6/0x386
[ 352.868708] Second to last potentially related work creation:
[ 352.869779] kasan_save_stack+0x1b/0x40
[ 352.870560] kasan_record_aux_stack+0xab/0xc0
[ 352.871426] call_rcu+0x5f/0x5c0
[ 352.872108] queue_rcu_work+0x44/0x50
[ 352.872855] __fl_put+0x17c/0x240 [cls_flower]
[ 352.873733] fl_delete+0xc7/0x100 [cls_flower]
[ 352.874607] tc_del_tfilter+0x510/0xb30
[ 352.886085] rtnetlink_rcv_msg+0x471/0x550
[ 352.886875] netlink_rcv_skb+0xc6/0x1f0
[ 352.887636] netlink_unicast+0x353/0x480
[ 352.888285] netlink_sendmsg+0x396/0x680
[ 352.888942] sock_sendmsg+0x6c/0x80
[ 352.889583] ____sys_sendmsg+0x3a5/0x3c0
[ 352.890311] ___sys_sendmsg+0xd8/0x140
[ 352.891019] __sys_sendmsg+0xb3/0x130
[ 352.891716] do_syscall_64+0x35/0x80
[ 352.892395] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 352.893666] The buggy address belongs to the object at ffff8881c8251000
which belongs to the cache kmalloc-2k of size 2048
[ 352.895696] The buggy address is located 1152 bytes inside of
2048-byte region [ffff8881c8251000, ffff8881c8251800)
[ 352.897640] The buggy address belongs to the page:
[ 352.898492] page:00000000213bac35 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1c8250
[ 352.900110] head:00000000213bac35 order:3 compound_mapcount:0 compound_pincount:0
[ 352.901541] flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
[ 352.902908] raw: 002ffff800010200 0000000000000000 dead000000000122 ffff888100042f00
[ 352.904391] raw: 0000000000000000 0000000000080008 00000001ffffffff 0000000000000000
[ 352.905861] page dumped because: kasan: bad access detected
[ 352.907323] Memory state around the buggy address:
[ 352.908218] ffff8881c8251380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 352.909471] ffff8881c8251400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 352.910735] >ffff8881c8251480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 352.912012] ^
[ 352.912642] ffff8881c8251500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 352.913919] ffff8881c8251580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 352.915185] ==================================================================
Fixes: d39d714969cd ("idr: introduce idr_for_each_entry_continue_ul()")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Syzkaller reported a false positive deadlock involving
the nl socket lock and the subflow socket lock:
MPTCP: kernel_bind error, err=-98
============================================
WARNING: possible recursive locking detected
5.15.0-rc1-syzkaller #0 Not tainted
--------------------------------------------
syz-executor998/6520 is trying to acquire lock:
ffff8880795718a0 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
but task is already holding lock:
ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(k-sk_lock-AF_INET);
lock(k-sk_lock-AF_INET);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by syz-executor998/6520:
#0: ffffffff8d176c50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 net/netlink/genetlink.c:802
#1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_lock net/netlink/genetlink.c:33 [inline]
#1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_rcv_msg+0x3e0/0x580 net/netlink/genetlink.c:790
#2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
#2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
stack backtrace:
CPU: 1 PID: 6520 Comm: syz-executor998 Not tainted 5.15.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_deadlock_bug kernel/locking/lockdep.c:2944 [inline]
check_deadlock kernel/locking/lockdep.c:2987 [inline]
validate_chain kernel/locking/lockdep.c:3776 [inline]
__lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5015
lock_acquire kernel/locking/lockdep.c:5625 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
lock_sock_fast+0x36/0x100 net/core/sock.c:3229
mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
__sock_release net/socket.c:649 [inline]
sock_release+0x87/0x1b0 net/socket.c:677
mptcp_pm_nl_create_listen_socket+0x238/0x2c0 net/mptcp/pm_netlink.c:900
mptcp_nl_cmd_add_addr+0x359/0x930 net/mptcp/pm_netlink.c:1170
genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_no_sendpage+0x101/0x150 net/core/sock.c:2980
kernel_sendpage.part.0+0x1a0/0x340 net/socket.c:3504
kernel_sendpage net/socket.c:3501 [inline]
sock_sendpage+0xe5/0x140 net/socket.c:1003
pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
splice_from_pipe_feed fs/splice.c:418 [inline]
__splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
splice_from_pipe fs/splice.c:597 [inline]
generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
do_splice_from fs/splice.c:767 [inline]
direct_splice_actor+0x110/0x180 fs/splice.c:936
splice_direct_to_actor+0x34b/0x8c0 fs/splice.c:891
do_splice_direct+0x1b3/0x280 fs/splice.c:979
do_sendfile+0xae9/0x1240 fs/read_write.c:1249
__do_sys_sendfile64 fs/read_write.c:1314 [inline]
__se_sys_sendfile64 fs/read_write.c:1300 [inline]
__x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1300
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f215cb69969
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc96bb3868 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007f215cbad072 RCX: 00007f215cb69969
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
RBP: 0000000000000000 R08: 00007ffc96bb3a08 R09: 00007ffc96bb3a08
R10: 0000000100000002 R11: 0000000000000246 R12: 00007ffc96bb387c
R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
the problem originates from uncorrect lock annotation in the mptcp
code and is only visible since commit 2dcb96bacce3 ("net: core: Correct
the sock::sk_lock.owned lockdep annotations"), but is present since
the port-based endpoint support initial implementation.
This patch addresses the issue introducing a nested variant of
lock_sock_fast() and using it in the relevant code path.
Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port")
Fixes: 2dcb96bacce3 ("net: core: Correct the sock::sk_lock.owned lockdep annotations")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-and-tested-by: syzbot+1dd53f7a89b299d59eaf@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rework the CPU selection in the migration worker to ensure the specified
number of migrations are performed when the test iteslf is affined to a
subset of CPUs. The existing logic skips iterations if the target CPU is
not in the original set of possible CPUs, which causes the test to fail
if too many iterations are skipped.
==== Test Assertion Failure ====
rseq_test.c:228: i > (NR_TASK_MIGRATIONS / 2)
pid=10127 tid=10127 errno=4 - Interrupted system call
1 0x00000000004018e5: main at rseq_test.c:227
2 0x00007fcc8fc66bf6: ?? ??:0
3 0x0000000000401959: _start at ??:?
Only performed 4 KVM_RUNs, task stalled too much?
Calculate the min/max possible CPUs as a cheap "best effort" to avoid
high runtimes when the test is affined to a small percentage of CPUs.
Alternatively, a list or xarray of the possible CPUs could be used, but
even in a horrendously inefficient setup, such optimizations are not
needed because the runtime is completely dominated by the cost of
migrating the task, and the absolute runtime is well under a minute in
even truly absurd setups, e.g. running on a subset of vCPUs in a VM that
is heavily overcommited (16 vCPUs per pCPU).
Fixes: 61e52f1630f5 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs")
Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210929234112.1862848-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Check whether a CPUID entry's index is significant before checking for a
matching index to hack-a-fix an undefined behavior bug due to consuming
uninitialized data. RESET/INIT emulation uses kvm_cpuid() to retrieve
CPUID.0x1, which does _not_ have a significant index, and fails to
initialize the dummy variable that doubles as EBX/ECX/EDX output _and_
ECX, a.k.a. index, input.
Practically speaking, it's _extremely_ unlikely any compiler will yield
code that causes problems, as the compiler would need to inline the
kvm_cpuid() call to detect the uninitialized data, and intentionally hose
the kernel, e.g. insert ud2, instead of simply ignoring the result of
the index comparison.
Although the sketchy "dummy" pattern was introduced in SVM by commit
66f7b72e1171 ("KVM: x86: Make register state after reset conform to
specification"), it wasn't actually broken until commit 7ff6c0350315
("KVM: x86: Remove stateful CPUID handling") arbitrarily swapped the
order of operations such that "index" was checked before the significant
flag.
Avoid consuming uninitialized data by reverting to checking the flag
before the index purely so that the fix can be easily backported; the
offending RESET/INIT code has been refactored, moved, and consolidated
from vendor code to common x86 since the bug was introduced. A future
patch will directly address the bad RESET/INIT behavior.
The undefined behavior was detected by syzbot + KernelMemorySanitizer.
BUG: KMSAN: uninit-value in cpuid_entry2_find arch/x86/kvm/cpuid.c:68
BUG: KMSAN: uninit-value in kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103
BUG: KMSAN: uninit-value in kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
cpuid_entry2_find arch/x86/kvm/cpuid.c:68 [inline]
kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103 [inline]
kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
kvm_vcpu_reset+0x13fb/0x1c20 arch/x86/kvm/x86.c:10885
kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
vcpu_enter_guest+0xfd2/0x6d80 arch/x86/kvm/x86.c:9534
vcpu_run+0x7f5/0x18d0 arch/x86/kvm/x86.c:9788
kvm_arch_vcpu_ioctl_run+0x245b/0x2d10 arch/x86/kvm/x86.c:10020
Local variable ----dummy@kvm_vcpu_reset created at:
kvm_vcpu_reset+0x1fb/0x1c20 arch/x86/kvm/x86.c:10812
kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
Reported-by: syzbot+f3985126b746b3d59c9d@syzkaller.appspotmail.com
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: 2a24be79b6b7 ("KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping")
Fixes: 7ff6c0350315 ("KVM: x86: Remove stateful CPUID handling")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210929222426.1855730-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
hv_clock is preallocated to have only HVC_BOOT_ARRAY_SIZE (64) elements;
if the PTP_SYS_OFFSET_PRECISE ioctl is executed on vCPUs whose index is
64 of higher, retrieving the struct pvclock_vcpu_time_info pointer with
"src = &hv_clock[cpu].pvti" will result in an out-of-bounds access and
a wild pointer. Change it to "this_cpu_pvti()" which is guaranteed to
be valid.
Fixes: 95a3d4454bb1 ("Switch kvmclock data to a PER_CPU variable")
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-3-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
There're other modules might use hv_clock_per_cpu variable like ptp_kvm,
so move it into kvmclock.h and export the symbol to make it visiable to
other modules.
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
commit fad7cd3310db ("nbd: add the check to prevent overflow in
__nbd_ioctl()") raised an issue from the fallback helpers added in
commit f0907827a8a9 ("compiler.h: enable builtin overflow checkers and
add fallback code")
ERROR: modpost: "__divdi3" [drivers/block/nbd.ko] undefined!
As Stephen Rothwell notes:
The added check_mul_overflow() call is being passed 64 bit values.
COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW is not set for this build (see
include/linux/overflow.h).
Specifically, the helpers for checking whether the results of a
multiplication overflowed (__unsigned_mul_overflow,
__signed_add_overflow) use the division operator when
!COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW. This is problematic for 64b
operands on 32b hosts.
This was fixed upstream by
commit 76ae847497bc ("Documentation: raise minimum supported version of
GCC to 5.1")
which is not suitable to be backported to stable.
Further, __builtin_mul_overflow() would emit a libcall to a
compiler-rt-only symbol when compiling with clang < 14 for 32b targets.
ld.lld: error: undefined symbol: __mulodi4
In order to keep stable buildable with GCC 4.9 and clang < 14, modify
struct nbd_config to instead track the number of bits of the block size;
reconstructing the block size using runtime checked shifts that are not
problematic for those compilers and in a ways that can be backported to
stable.
In nbd_set_size, we do validate that the value of blksize must be a
power of two (POT) and is in the range of [512, PAGE_SIZE] (both
inclusive).
This does modify the debugfs interface.
Cc: stable@vger.kernel.org
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Link: https://lore.kernel.org/all/20210909182525.372ee687@canb.auug.org.au/
Link: https://lore.kernel.org/stable/CAHk-=whiQBofgis_rkniz8GBP9wZtSZdcDEffgSLO62BUGV3gg@mail.gmail.com/
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Suggested-by: Kees Cook <keescook@chromium.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210920232533.4092046-1-ndesaulniers@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|