| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
Hidden gendisks never hash the block device inode, so this can't happen.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
| |
The GENHD_FL_NO_PART_SCAN controls more than just partitions canning,
so rename it to GENHD_FL_NO_PART.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20211122130625.1136848-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
| |
Unify the functionality that implements a partition rescan for a
gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
| |
disk_max_parts never returns 0 given that ->minors for devices not using
the extended dev_t must be non-zero, and disk_max_parts always returns
DISK_MAX_PARTS for the latter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GENHD_FL_CD marks a gendisk as a vaguely CD-ROM like device.
Besides being used internally inside of sunvdc.c an xen-blkfront it
is used by xen-blkback as a hint to claim a device exported to a
guest is a CD-ROM like device. Just check for disk->cdi instead
which is the right indicator for "real" CD-ROM or DVD drivers. This
will miss the paravirtualized guest drivers, but those make little
sense to report anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
| |
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE is all about the event reporting
mechanism, so move it to the event_flags field.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
| |
The flag to indicate an unlocked native capacity is dynamic state,
not a driver capability flag, so move it to disk->state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
| |
All request based code is in the blk-mq files now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
| |
This function is only used by the request completion path. Factor out
a blk_status_to_str to keep blk_errors private in blk-core.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
| |
blk_dump_rq_flags deals with a request, so move it to blk-mq.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
| |
These are only used for request based I/O, so move them where they are
used.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
| |
Keep all the request based code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
| |
blk_rq_init deals with a request structure, so move it to blk-mq.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
| |
Keep all the request based code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
| |
Move blk_mq_flush_plug_list and blk_mq_plug_issue_direct down in blk-mq.c
to prepare for marking blk_mq_request_issue_directly static without the
need of a forward declaration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
| |
All this code is tightly coupled to the blk-mq core, so move it
there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-4-hch@lst.de
[axboe: remove doc generation for blk-exec.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
| |
This function is trivial, and flush_dcache_page is always defined, so
just open code it in the 2.5 callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
| |
blk_rq_err_bytes is only used by the scsi midlayer, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull vhost,virtio,vdpa bugfixes from Michael Tsirkin:
"Misc fixes all over the place.
Revert of virtio used length validation series: the approach taken
does not seem to work, breaking too many guests in the process. We'll
need to do length validation using some other approach"
[ This merge also ends up reverting commit f7a36b03a732 ("vsock/virtio:
suppress used length validation"), which came in through the
networking tree in the meantime, and was part of that whole used
length validation series - Linus ]
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa_sim: avoid putting an uninitialized iova_domain
vhost-vdpa: clean irqs before reseting vdpa device
virtio-blk: modify the value type of num in virtio_queue_rq()
vhost/vsock: cleanup removing `len` variable
vhost/vsock: fix incorrect used length reported to the guest
Revert "virtio_ring: validate used buffer length"
Revert "virtio-net: don't let virtio core to validate used length"
Revert "virtio-blk: don't let virtio core to validate used length"
Revert "virtio-scsi: don't let virtio core to validate used buffer length"
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The system will crash if we put an uninitialized iova_domain, this
could happen when an error occurs before initializing the iova_domain
in vdpasim_create().
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
RIP: 0010:__cpuhp_state_remove_instance+0x96/0x1c0
...
Call Trace:
<TASK>
put_iova_domain+0x29/0x220
vdpasim_free+0xd1/0x120 [vdpa_sim]
vdpa_release_dev+0x21/0x40 [vdpa]
device_release+0x33/0x90
kobject_release+0x63/0x160
vdpasim_create+0x127/0x2a0 [vdpa_sim]
vdpasim_net_dev_add+0x7d/0xfe [vdpa_sim_net]
vdpa_nl_cmd_dev_add_set_doit+0xe1/0x1a0 [vdpa]
genl_family_rcv_msg_doit+0x112/0x140
genl_rcv_msg+0xdf/0x1d0
...
So we must make sure the iova_domain is already initialized before
put it.
In addition, we may get the following warning in this case:
WARNING: ... drivers/iommu/iova.c:344 iova_cache_put+0x58/0x70
So we must make sure the iova_cache_put() is invoked only if the
iova_cache_get() is already invoked. Let's fix it together.
Cc: stable@vger.kernel.org
Fixes: 4080fc106750 ("vdpa_sim: use iova module to allocate IOVA addresses")
Signed-off-by: Longpeng <longpeng2@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20211124015215.119-1-longpeng2@huawei.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Vdpa devices should be reset after unseting irqs of virtqueues, or we
will get errors when killing qemu process:
>> pi_update_irte: failed to update PI IRTE
>> irq bypass consumer (token 0000000065102a43) unregistration fails: -22
Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
Link: https://lore.kernel.org/r/a2cb60cf73be9da5c4e6399242117d8818f975ae.1636946171.git.wuzongyong@linux.alibaba.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This was found by coccicheck:
./drivers/block/virtio_blk.c, 334, 14-17, WARNING Unsigned expression
compared with zero num < 0
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Ye Guojin <ye.guojin@zte.com.cn>
Link: https://lore.kernel.org/r/20211117063955.160777-1-ye.guojin@zte.com.cn
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 02746e26c39e ("virtio-blk: avoid preallocating big SGL for data")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We can increment `total_len` directly and remove `len` since it
is no longer used for vhost_add_used().
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20211122163525.294024-3-sgarzare@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The "used length" reported by calling vhost_add_used() must be the
number of bytes written by the device (using "in" buffers).
In vhost_vsock_handle_tx_kick() the device only reads the guest
buffers (they are all "out" buffers), without writing anything,
so we must pass 0 as "used length" to comply virtio spec.
Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko")
Cc: stable@vger.kernel.org
Reported-by: Halil Pasic <pasic@linux.ibm.com>
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20211122163525.294024-2-sgarzare@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit 939779f5152d161b34f612af29e7dc1ac4472fcf.
Attempts to validate length in the core did not work out: there turn out
to exist multiple broken devices, and in particular legacy devices are
known to be broken in this respect.
We have ideas for handling this better in the next version but for now
let's revert to a known good state to make sure drivers work for people.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit 816625c13652cef5b2c49082d652875da6f2ad7a.
Attempts to validate length in the core did not work out.
We'll drop them, so revert the dependent changes in drivers.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit a40392edf1b2c7822bc0ce68413106661a9d4232.
Attempts to validate length in the core did not work out.
We'll drop them, so revert the dependent changes in drivers.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit c57911ebfbfe745cb95da2bcf547c5bae000590f.
Attempts to validate length in the core did not work out. We'll drop
them for now, so revert the dependent changes in drivers.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build fix from Thomas Gleixner:
"A single fix for a missing __init annotation of prepare_command_line()"
* tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Mark prepare_command_line() __init
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Fix:
WARNING: modpost: vmlinux.o(.text.unlikely+0x64d0): Section mismatch in reference \
from the function prepare_command_line() to the variable .init.data:command_line
The function prepare_command_line() references
the variable __initdata command_line.
This is often because prepare_command_line lacks a __initdata
annotation or the annotation of command_line is wrong.
Apparently some toolchains do different inlining decisions.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YZySgpmBcNNM2qca@zn.tnic
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single scheduler fix to ensure that there is no stale KASAN shadow
state left on the idle task's stack when a CPU is brought up after it
was brought down before"
* tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/scs: Reset task stack state in bringup_cpu()
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
To hot unplug a CPU, the idle task on that CPU calls a few layers of C
code before finally leaving the kernel. When KASAN is in use, poisoned
shadow is left around for each of the active stack frames, and when
shadow call stacks are in use. When shadow call stacks (SCS) are in use
the task's saved SCS SP is left pointing at an arbitrary point within
the task's shadow call stack.
When a CPU is offlined than onlined back into the kernel, this stale
state can adversely affect execution. Stale KASAN shadow can alias new
stackframes and result in bogus KASAN warnings. A stale SCS SP is
effectively a memory leak, and prevents a portion of the shadow call
stack being used. Across a number of hotplug cycles the idle task's
entire shadow call stack can become unusable.
We previously fixed the KASAN issue in commit:
e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug")
... by removing any stale KASAN stack poison immediately prior to
onlining a CPU.
Subsequently in commit:
f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled")
... the refactoring left the KASAN and SCS cleanup in one-time idle
thread initialization code rather than something invoked prior to each
CPU being onlined, breaking both as above.
We fixed SCS (but not KASAN) in commit:
63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit")
... but as this runs in the context of the idle task being offlined it's
potentially fragile.
To fix these consistently and more robustly, reset the SCS SP and KASAN
shadow of a CPU's idle task immediately before we online that CPU in
bringup_cpu(). This ensures the idle task always has a consistent state
when it is running, and removes the need to so so when exiting an idle
task.
Whenever any thread is created, dup_task_struct() will give the task a
stack which is free of KASAN shadow, and initialize the task's SCS SP,
so there's no need to specially initialize either for idle thread within
init_idle(), as this was only necessary to handle hotplug cycles.
I've tested this on arm64 with:
* gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK
* clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK
... offlining and onlining CPUS with:
| while true; do
| for C in /sys/devices/system/cpu/cpu*/online; do
| echo 0 > $C;
| echo 1 > $C;
| done
| done
Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled")
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Qian Cai <quic_qiancai@quicinc.com>
Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
"A single fix for perf to prevent it from sending SIGTRAP to another
task from a trace point event as it's not possible to deliver a
synchronous signal to a different task from there"
* tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Ignore sigtrap for tracepoints destined for other tasks
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
syzbot reported that the warning in perf_sigtrap() fires, saying that
the event's task does not match current:
| WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| Modules linked in:
| CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0
| Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
| RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline]
| RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline]
| RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| ...
| Call Trace:
| <IRQ>
| irq_work_single+0x106/0x220 kernel/irq_work.c:211
| irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242
| irq_work_run+0x4f/0xd0 kernel/irq_work.c:251
| __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22
| sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17
| </IRQ>
| <TASK>
| asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664
| RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
| RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194
| ...
| coredump_task_exit kernel/exit.c:371 [inline]
| do_exit+0x1865/0x25c0 kernel/exit.c:771
| do_group_exit+0xe7/0x290 kernel/exit.c:929
| get_signal+0x3b0/0x1ce0 kernel/signal.c:2820
| arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
| handle_signal_work kernel/entry/common.c:148 [inline]
| exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
| exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
| __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
| syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
| do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
| entry_SYSCALL_64_after_hwframe+0x44/0xae
On x86 this shouldn't happen, which has arch_irq_work_raise().
The test program sets up a perf event with sigtrap set to fire on the
'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup().
This happened because the 'sched_wakeup' tracepoint also takes a task
argument passed on to perf_tp_event(), which is used to deliver the
event to that other task.
Since we cannot deliver synchronous signals to other tasks, skip an event if
perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is
set, which will avoid ever entering perf_sigtrap() for such events.
Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events")
Reported-by: syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two regression fixes for reader writer semaphores:
- Plug a race in the lock handoff which is caused by inconsistency of
the reader and writer path and can lead to corruption of the
underlying counter.
- down_read_trylock() is suboptimal when the lock is contended and
multiple readers trylock concurrently. That's due to the initial
value being read non-atomically which results in at least two
compare exchange loops. Making the initial readout atomic reduces
this significantly. Whith 40 readers by 11% in a benchmark which
enforces contention on mmap_sem"
* tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Optimize down_read_trylock() under highly contended case
locking/rwsem: Make handoff bit handling more consistent
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.
```c++
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sched.h>
#include <cstdio>
#include <cassert>
#include <thread>
#include <vector>
#include <chrono>
volatile int mutex;
void trigger(int cpu, char* ptr, std::size_t sz)
{
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);
while (mutex);
for (std::size_t i = 0; i < sz; i += 4096) {
*ptr = '\0';
ptr += 4096;
}
}
int main(int argc, char* argv[])
{
std::size_t sz = 100;
if (argc > 1)
sz = atoi(argv[1]);
auto nproc = std::thread::hardware_concurrency();
std::vector<std::thread> thr;
sz <<= 30;
auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
MAP_PRIVATE, -1, 0);
assert(ptr != MAP_FAILED);
char* cptr = static_cast<char*>(ptr);
auto run = sz / nproc;
run = (run >> 12) << 12;
mutex = 1;
for (auto i = 0U; i < nproc; ++i) {
thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
cptr += run;
}
rusage usage_start;
getrusage(RUSAGE_SELF, &usage_start);
auto start = std::chrono::system_clock::now();
mutex = 0;
for (auto& t : thr)
t.join();
rusage usage_end;
getrusage(RUSAGE_SELF, &usage_end);
auto end = std::chrono::system_clock::now();
timeval utime;
timeval stime;
timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
printf("real: %lu\n",
std::chrono::duration_cast<std::chrono::milliseconds>(end -
start).count());
return 0;
}
```
The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.
25.55% [kernel] [k] down_read_trylock
14.78% [kernel] [k] handle_mm_fault
13.45% [kernel] [k] up_read
8.61% [kernel] [k] clear_page_erms
3.89% [kernel] [k] __do_page_fault
The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.
91.89 │ lock cmpxchg %rdx,(%rdi)
Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.
By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:
Before Patch After Patch
# of Threads real real reduced by
------------ ------ ------ ----------
1 65,373 65,206 ~0.0%
4 15,467 15,378 ~0.5%
40 6,214 5,528 ~11.0%
For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There are some inconsistency in the way that the handoff bit is being
handled in readers and writers that lead to a race condition.
Firstly, when a queue head writer set the handoff bit, it will clear
it when the writer is being killed or interrupted on its way out
without acquiring the lock. That is not the case for a queue head
reader. The handoff bit will simply be inherited by the next waiter.
Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes
empty. For rwsem_down_write_slowpath(), however, the handoff bit is
not checked and cleared if the wait queue is empty. This can
potentially make the handoff bit set with empty wait queue.
Worse, the situation in rwsem_down_write_slowpath() relies on wstate,
a variable set outside of the critical section containing the ->count
manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF
can be double subtracted, corrupting ->count.
To make the handoff bit handling more consistent and robust, extract
out handoff bit clearing code into the new rwsem_del_waiter() helper
function. Also, completely eradicate wstate; always evaluate
everything inside the same critical section.
The common function will only use atomic_long_andnot() to clear bits
when the wait queue is empty to avoid possible race condition. If the
first waiter with handoff bit set is killed or interrupted to exit the
slowpath without acquiring the lock, the next waiter will inherit the
handoff bit.
While at it, simplify the trylock for loop in
rwsem_down_write_slowpath() to make it easier to read.
Fixes: 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull another tracing fix from Steven Rostedt:
"Fix the fix of pid filtering
The setting of the pid filtering flag tested the "trace only this pid"
case twice, and ignored the "trace everything but this pid" case.
The 5.15 kernel does things a little differently due to the new sparse
pid mask introduced in 5.16, and as the bug was discovered running the
5.15 kernel, and the first fix was initially done for that kernel,
that fix handled both cases (only pid and all but pid), but the
forward port to 5.16 created this bug"
* tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Test the 'Do not trace this pid' case in create event
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When creating a new event (via a module, kprobe, eprobe, etc), the
descriptors that are created must add flags for pid filtering if an
instance has pid filtering enabled, as the flags are used at the time the
event is executed to know if pid filtering should be done or not.
The "Only trace this pid" case was added, but a cut and paste error made
that case checked twice, instead of checking the "Trace all but this pid"
case.
Link: https://lore.kernel.org/all/202111280401.qC0z99JB-lkp@intel.com/
Fixes: 6cb206508b62 ("tracing: Check pid filtering when creating events")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Intel VT-d fixes:
- Remove unused PASID_DISABLED
- Fix RCU locking
- Fix for the unmap_pages call-back
- Rockchip RK3568 address mask fix
- AMD IOMMUv2 log message clarification
* tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Fix unmap_pages support
iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock()
iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568
iommu/amd: Clarify AMD IOMMUv2 initialization messages
iommu/vt-d: Remove unused PASID_DISABLED
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When supporting only the .map and .unmap callbacks of iommu_ops,
the IOMMU driver can make assumptions about the size and alignment
used for mappings based on the driver provided pgsize_bitmap. VT-d
previously used essentially PAGE_MASK for this bitmap as any power
of two mapping was acceptably filled by native page sizes.
However, with the .map_pages and .unmap_pages interface we're now
getting page-size and count arguments. If we simply combine these
as (page-size * count) and make use of the previous map/unmap
functions internally, any size and alignment assumptions are very
different.
As an example, a given vfio device assignment VM will often create
a 4MB mapping at IOVA pfn [0x3fe00 - 0x401ff]. On a system that
does not support IOMMU super pages, the unmap_pages interface will
ask to unmap 1024 4KB pages at the base IOVA. dma_pte_clear_level()
will recurse down to level 2 of the page table where the first half
of the pfn range exactly matches the entire pte level. We clear the
pte, increment the pfn by the level size, but (oops) the next pte is
on a new page, so we exit the loop an pop back up a level. When we
then update the pfn based on that higher level, we seem to assume
that the previous pfn value was at the start of the level. In this
case the level size is 256K pfns, which we add to the base pfn and
get a results of 0x7fe00, which is clearly greater than 0x401ff,
so we're done. Meanwhile we never cleared the ptes for the remainder
of the range. When the VM remaps this range, we're overwriting valid
ptes and the VT-d driver complains loudly, as reported by the user
report linked below.
The fix for this seems relatively simple, if each iteration of the
loop in dma_pte_clear_level() is assumed to clear to the end of the
level pte page, then our next pfn should be calculated from level_pfn
rather than our working pfn.
Fixes: 3f34f1259776 ("iommu/vt-d: Implement map/unmap_pages() iommu_ops callback")
Reported-by: Ajay Garg <ajaygargnsit@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Link: https://lore.kernel.org/all/20211002124012.18186-1-ajaygargnsit@gmail.com/
Link: https://lore.kernel.org/r/163659074748.1617923.12716161410774184024.stgit@omen
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211126135556.397932-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If we return -EOPNOTSUPP, the rcu lock remains lock. This is spurious.
Go through the end of the function instead. This way, the missing
'rcu_read_unlock()' is called.
Fixes: 7afd7f6aa21a ("iommu/vt-d: Check FL and SL capability sanity in scalable mode")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/40cc077ca5f543614eab2a10e84d29dd190273f6.1636217517.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211126135556.397932-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
With the submission of iommu driver for RK3568 a subtle bug was
introduced: PAGE_DESC_HI_MASK1 and PAGE_DESC_HI_MASK2 have to be
the other way arround - that leads to random errors, especially when
addresses beyond 32 bit are used.
Fix it.
Fixes: c55356c534aa ("iommu: rockchip: Add support for iommu v2")
Signed-off-by: Alex Bee <knaerzche@gmail.com>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Tested-by: Dan Johansen <strit@manjaro.org>
Reviewed-by: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Link: https://lore.kernel.org/r/20211124021325.858139-1-knaerzche@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The messages printed on the initialization of the AMD IOMMUv2 driver
have caused some confusion in the past. Clarify the messages to lower
the confusion in the future.
Cc: stable@vger.kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20211123105507.7654-3-joro@8bytes.org
|
| | |/
| |/|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The macro is unused after commit 00ecd5401349a so it can be removed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 00ecd5401349a ("iommu/vt-d: Clean up unused PASID updating functions")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211123105507.7654-2-joro@8bytes.org
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull ksmbd fixes from Steve French:
"Five ksmbd server fixes, four of them for stable:
- memleak fix
- fix for default data stream on filesystems that don't support xattr
- error logging fix
- session setup fix
- minor doc cleanup"
* tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix memleak in get_file_stream_info()
ksmbd: contain default data stream even if xattr is empty
ksmbd: downgrade addition info error msg to debug in smb2_get_info_sec()
docs: filesystem: cifs: ksmbd: Fix small layout issues
ksmbd: Fix an error handling path in 'smb2_sess_setup()'
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Fix memleak in get_file_stream_info()
Fixes: 34061d6b76a4 ("ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests")
Cc: stable@vger.kernel.org # v5.15
Reported-by: Coverity Scan <scan-admin@coverity.com>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If xattr is not supported like exfat or fat, ksmbd server doesn't
contain default data stream in FILE_STREAM_INFORMATION response. It will
cause ppt or doc file update issue if local filesystem is such as ones.
This patch move goto statement to contain it.
Fixes: 9f6323311c70 ("ksmbd: add default data stream name in FILE_STREAM_INFORMATION")
Cc: stable@vger.kernel.org # v5.15
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
While file transfer through windows client, This error flood message
happen. This flood message will cause performance degradation and
misunderstand server has problem.
Fixes: e294f78d3478 ("ksmbd: allow PROTECTED_DACL_SECINFO and UNPROTECTED_DACL_SECINFO addition information in smb2 set info security")
Cc: stable@vger.kernel.org # v5.15
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|