| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| | |
Cross-merge networking fixes after downstream PR.
No conflicts (or adjacent changes of note).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth, netfilter, BPF and WiFi.
I didn't collect precise data but feels like we've got a lot of 6.5
fixes here. WiFi fixes are most user-awaited.
Current release - regressions:
- Bluetooth: fix hci_link_tx_to RCU lock usage
Current release - new code bugs:
- bpf: mprog: fix maximum program check on mprog attachment
- eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
Previous releases - regressions:
- ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
- vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it
doesn't handle zero length like we expected
- wifi:
- cfg80211: fix cqm_config access race, fix crashes with brcmfmac
- iwlwifi: mvm: handle PS changes in vif_cfg_changed
- mac80211: fix mesh id corruption on 32 bit systems
- mt76: mt76x02: fix MT76x0 external LNA gain handling
- Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
- l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
- dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
- eth: stmmac: fix the incorrect parameter after refactoring
Previous releases - always broken:
- net: replace calls to sock->ops->connect() with kernel_connect(),
prevent address rewrite in kernel_bind(); otherwise BPF hooks may
modify arguments, unexpectedly to the caller
- tcp: fix delayed ACKs when reads and writes align with MSS
- bpf:
- verifier: unconditionally reset backtrack_state masks on global
func exit
- s390: let arch_prepare_bpf_trampoline return program size, fix
struct_ops offsets
- sockmap: fix accounting of available bytes in presence of PEEKs
- sockmap: reject sk_msg egress redirects to non-TCP sockets
- ipv4/fib: send netlink notify when delete source address routes
- ethtool: plca: fix width of reads when parsing netlink commands
- netfilter: nft_payload: rebuild vlan header on h_proto access
- Bluetooth: hci_codec: fix leaking memory of local_codecs
- eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
- eth: stmmac:
- dwmac-stm32: fix resume on STM32 MCU
- remove buggy and unneeded stmmac_poll_controller, depend on NAPI
- ibmveth: always recompute TCP pseudo-header checksum, fix use of
the driver with Open vSwitch
- wifi:
- rtw88: rtw8723d: fix MAC address offset in EEPROM
- mt76: fix lock dependency problem for wed_lock
- mwifiex: sanity check data reported by the device
- iwlwifi: ensure ack flag is properly cleared
- iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
- iwlwifi: mvm: fix incorrect usage of scan API
Misc:
- wifi: mac80211: work around Cisco AP 9115 VHT MPDU length"
* tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
MAINTAINERS: update Matthieu's email address
mptcp: userspace pm allow creating id 0 subflow
mptcp: fix delegated action races
net: stmmac: remove unneeded stmmac_poll_controller
net: lan743x: also select PHYLIB
net: ethernet: mediatek: disable irq before schedule napi
net: mana: Fix oversized sge0 for GSO packets
net: mana: Fix the tso_bytes calculation
net: mana: Fix TX CQE error handling
netlink: annotate data-races around sk->sk_err
sctp: update hb timer immediately after users change hb_interval
sctp: update transport state when processing a dupcook packet
tcp: fix delayed ACKs for MSS boundary condition
tcp: fix quick-ack counting to count actual ACKs of new data
page_pool: fix documentation typos
tipc: fix a potential deadlock on &tx->lock
net: stmmac: dwmac-stm32: fix resume on STM32 MCU
ipv4: Set offload_failed flag in fibmatch results
netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
netfilter: nf_tables: Deduplicate nft_register_obj audit logs
...
|
| | |\
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-10-02
We've added 11 non-merge commits during the last 12 day(s) which contain
a total of 12 files changed, 176 insertions(+), 41 deletions(-).
The main changes are:
1) Fix BPF verifier to reset backtrack_state masks on global function
exit as otherwise subsequent precision tracking would reuse them,
from Andrii Nakryiko.
2) Several sockmap fixes for available bytes accounting,
from John Fastabend.
3) Reject sk_msg egress redirects to non-TCP sockets given this
is only supported for TCP sockets today, from Jakub Sitnicki.
4) Fix a syzkaller splat in bpf_mprog when hitting maximum program
limits with BPF_F_BEFORE directive, from Daniel Borkmann
and Nikolay Aleksandrov.
5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust
size_index for selecting a bpf_mem_cache, from Hou Tao.
6) Fix arch_prepare_bpf_trampoline return code for s390 JIT,
from Song Liu.
7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off,
from Leon Hwang.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Use kmalloc_size_roundup() to adjust size_index
selftest/bpf: Add various selftests for program limits
bpf, mprog: Fix maximum program check on mprog attachment
bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
bpf, sockmap: Add tests for MSG_F_PEEK
bpf, sockmap: Do not inc copied_seq when PEEK flag set
bpf: tcp_read_skb needs to pop skb regardless of seq
bpf: unconditionally reset backtrack_state masks on global func exit
bpf: Fix tr dereferencing
selftests/bpf: Check bpf_cubic_acked() is called via struct_ops
s390/bpf: Let arch_prepare_bpf_trampoline return program size
====================
Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Commit d52b59315bf5 ("bpf: Adjust size_index according to the value of
KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as
reported by Nathan, the adjustment is not enough, because
__kmalloc_minalign() also decides the minimal alignment of slab object
as shown in new_kmalloc_cache() and its value may be greater than
KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM).
Instead of invoking __kmalloc_minalign() in bpf subsystem to find the
maximal alignment, just using kmalloc_size_roundup() directly to get the
corresponding slab object size for each allocation size. If these two
sizes are unmatched, adjust size_index to select a bpf_mem_cache with
unit_size equal to the object_size of the underlying slab cache for the
allocation size.
Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
After Paul's recent improvement to syzkaller to improve coverage for
bpf_mprog and tcx, it hit a splat that the program limit was surpassed.
What happened is that the maximum number of progs got added, followed
by another prog add request which adds with BPF_F_BEFORE flag relative
to the last program in the array. The idx >= bpf_mprog_max() check in
bpf_mprog_attach() still passes because the index is below the maximum
but the maximum will be surpassed. We need to add a check upfront for
insertions to catch this situation.
Fixes: 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs")
Reported-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com
Reported-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com
Reported-by: syzbot+2558ca3567a77b7af4e3@syzkaller.appspotmail.com
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com
Tested-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com
Link: https://github.com/google/syzkaller/pull/4207
Link: https://lore.kernel.org/bpf/20230929204121.20305-1-daniel@iogearbox.net
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In mark_chain_precision() logic, when we reach the entry to a global
func, it is expected that R1-R5 might be still requested to be marked
precise. This would correspond to some integer input arguments being
tracked as precise. This is all expected and handled as a special case.
What's not expected is that we'll leave backtrack_state structure with
some register bits set. This is because for subsequent precision
propagations backtrack_state is reused without clearing masks, as all
code paths are carefully written in a way to leave empty backtrack_state
with zeroed out masks, for speed.
The fix is trivial, we always clear register bit in the register mask, and
then, optionally, set reg->precise if register is SCALAR_VALUE type.
Reported-by: Chris Mason <clm@meta.com>
Fixes: be2ef8161572 ("bpf: allow precision tracking for programs with subprogs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230918210110.2241458-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| |\ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Fourteen hotfixes, eleven of which are cc:stable. The remainder
pertain to issues which were introduced after 6.5"
* tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Crash: add lock to serialize crash hotplug handling
selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()
mm, memcg: reconsider kmem.limit_in_bytes deprecation
mm: zswap: fix potential memory corruption on duplicate store
arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
mm: hugetlb: add huge page size param to set_huge_pte_at()
maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states
maple_tree: add mas_is_active() to detect in-tree walks
nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
mm: abstract moving to the next PFN
mm: report success more often from filemap_map_folio_range()
fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Eric reported that handling corresponding crash hotplug event can be
failed easily when many memory hotplug event are notified in a short
period. They failed because failing to take __kexec_lock.
=======
[ 78.714569] Fallback order for Node 0: 0
[ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886
[ 78.717133] Policy zone: Normal
[ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 80.056643] PEFILE: Unsigned PE binary
=======
The memory hotplug events are notified very quickly and very many, while
the handling of crash hotplug is much slower relatively. So the atomic
variable __kexec_lock and kexec_trylock() can't guarantee the
serialization of crash hotplug handling.
Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
handling specifically. This doesn't impact the usage of __kexec_lock.
Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
Fixes: 247262756121 ("crash: add generic infrastructure for crash hotplug support")
Signed-off-by: Baoquan He <bhe@redhat.com>
Tested-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
| |\ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a RT tasks related lockup/live-lock during CPU offlining"
* tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Fix live lock between select_fallback_rq() and RT push
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
During RCU-boost testing with the TREE03 rcutorture config, I found that
after a few hours, the machine locks up.
On tracing, I found that there is a live lock happening between 2 CPUs.
One CPU has an RT task running, while another CPU is being offlined
which also has an RT task running. During this offlining, all threads
are migrated. The migration thread is repeatedly scheduled to migrate
actively running tasks on the CPU being offlined. This results in a live
lock because select_fallback_rq() keeps picking the CPU that an RT task
is already running on only to get pushed back to the CPU being offlined.
It is anyway pointless to pick CPUs for pushing tasks to if they are
being offlined only to get migrated away to somewhere else. This could
also add unwanted latency to this task.
Fix these issues by not selecting CPUs in RT if they are not 'active'
for scheduling, using the cpu_active_mask. Other parts in core.c already
use cpu_active_mask to prevent tasks from being put on CPUs going
offline.
With this fix I ran the tests for days and could not reproduce the
hang. Without the patch, I hit it in a few hours.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org
|
| |\ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Make sure 32-bit applications using user events have aligned access
when running on a 64-bit kernel.
- Add cond_resched in the loop that handles converting enums in
print_fmt string is trace events.
- Fix premature wake ups of polling processes in the tracing ring
buffer. When a task polls waiting for a percentage of the ring buffer
to be filled, the writer still will wake it up at every event. Add
the polling's percentage to the "shortest_full" list to tell the
writer when to wake it up.
- For eventfs dir lookups on dynamic events, an event system's only
event could be removed, leaving its dentry with no children. This is
totally legitimate. But in eventfs_release() it must not access the
children array, as it is only allocated when the dentry has children.
* tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Test for dentries array allocated in eventfs_release()
tracing/user_events: Align set_bit() address for all archs
tracing: relax trace_event_eval_update() execution with cond_resched()
ring-buffer: Update "shortest_full" in polling
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
All architectures should use a long aligned address passed to set_bit().
User processes can pass either a 32-bit or 64-bit sized value to be
updated when tracing is enabled when on a 64-bit kernel. Both cases are
ensured to be naturally aligned, however, that is not enough. The
address must be long aligned without affecting checks on the value
within the user process which require different adjustments for the bit
for little and big endian CPUs.
Add a compat flag to user_event_enabler that indicates when a 32-bit
value is being used on a 64-bit kernel. Long align addresses and correct
the bit to be used by set_bit() to account for this alignment. Ensure
compat flags are copied during forks and used during deletion clears.
Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/
Cc: stable@vger.kernel.org
Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement")
Reported-by: Clément Léger <cleger@rivosinc.com>
Suggested-by: Clément Léger <cleger@rivosinc.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
When kernel is compiled without preemption, the eval_map_work_func()
(which calls trace_event_eval_update()) will not be preempted up to its
complete execution. This can actually cause a problem since if another
CPU call stop_machine(), the call will have to wait for the
eval_map_work_func() function to finish executing in the workqueue
before being able to be scheduled. This problem was observe on a SMP
system at boot time, when the CPU calling the initcalls executed
clocksource_done_booting() which in the end calls stop_machine(). We
observed a 1 second delay because one CPU was executing
eval_map_work_func() and was not preempted by the stop_machine() task.
Adding a call to cond_resched() in trace_event_eval_update() allows
other tasks to be executed and thus continue working asynchronously
like before without blocking any pending task at boot time.
Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| | |/ / / /
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
It was discovered that the ring buffer polling was incorrectly stating
that read would not block, but that's because polling did not take into
account that reads will block if the "buffer-percent" was set. Instead,
the ring buffer polling would say reads would not block if there was any
data in the ring buffer. This was incorrect behavior from a user space
point of view. This was fixed by commit 42fb0a1e84ff by having the polling
code check if the ring buffer had more data than what the user specified
"buffer percent" had.
The problem now is that the polling code did not register itself to the
writer that it wanted to wait for a specific "full" value of the ring
buffer. The result was that the writer would wake the polling waiter
whenever there was a new event. The polling waiter would then wake up, see
that there's not enough data in the ring buffer to notify user space and
then go back to sleep. The next event would wake it up again.
Before the polling fix was added, the code would wake up around 100 times
for a hackbench 30 benchmark. After the "fix", due to the constant waking
of the writer, it would wake up over 11,0000 times! It would never leave
the kernel, so the user space behavior was still "correct", but this
definitely is not the desired effect.
To fix this, have the polling code add what it's waiting for to the
"shortest_full" variable, to tell the writer not to wake it up if the
buffer is not as full as it expects to be.
Note, after this fix, it appears that the waiter is now woken up around 2x
the times it was before (~200). This is a tremendous improvement from the
11,000 times, but I will need to spend some time to see why polling is
more aggressive in its wakeups than the read blocking code.
Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Tested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| |\ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix the narea calculation in swiotlb initialization (Ross Lagerwall)
- fix the check whether a device has used swiotlb (Petr Tesarik)
* tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix the check whether a device has used software IO TLB
swiotlb: use the calculated number of areas
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
When CONFIG_SWIOTLB_DYNAMIC=y, devices which do not use the software IO TLB
can avoid swiotlb lookup. A flag is added by commit 1395706a1490 ("swiotlb:
search the software IO TLB only if the device makes use of it"), the flag
is correctly set, but it is then never checked. Add the actual check here.
Note that this code is an alternative to the default pool check, not an
additional check, because:
1. swiotlb_find_pool() also searches the default pool;
2. if dma_uses_io_tlb is false, the default swiotlb pool is not used.
Tested in a KVM guest against a QEMU RAM-backed SATA disk over virtio and
*not* using software IO TLB, this patch increases IOPS by approx 2% for
4-way parallel I/O.
The write memory barrier in swiotlb_dyn_alloc() is not needed, because a
newly allocated pool must always be observed by swiotlb_find_slots() before
an address from that pool is passed to is_swiotlb_buffer().
Correctness was verified using the following litmus test:
C swiotlb-new-pool
(*
* Result: Never
*
* Check that a newly allocated pool is always visible when the
* corresponding swiotlb buffer is visible.
*)
{
mem_pools = default;
}
P0(int **mem_pools, int *pool)
{
/* add_mem_pool() */
WRITE_ONCE(*pool, 999);
rcu_assign_pointer(*mem_pools, pool);
}
P1(int **mem_pools, int *flag, int *buf)
{
/* swiotlb_find_slots() */
int *r0;
int r1;
rcu_read_lock();
r0 = READ_ONCE(*mem_pools);
r1 = READ_ONCE(*r0);
rcu_read_unlock();
if (r1) {
WRITE_ONCE(*flag, 1);
smp_mb();
}
/* device driver (presumed) */
WRITE_ONCE(*buf, r1);
}
P2(int **mem_pools, int *flag, int *buf)
{
/* device driver (presumed) */
int r0 = READ_ONCE(*buf);
/* is_swiotlb_buffer() */
int r1;
int *r2;
int r3;
smp_rmb();
r1 = READ_ONCE(*flag);
if (r1) {
/* swiotlb_find_pool() */
rcu_read_lock();
r2 = READ_ONCE(*mem_pools);
r3 = READ_ONCE(*r2);
rcu_read_unlock();
}
}
exists (2:r0<>0 /\ 2:r3=0) (* Not found. *)
Fixes: 1395706a1490 ("swiotlb: search the software IO TLB only if the device makes use of it")
Reported-by: Jonathan Corbet <corbet@lwn.net>
Closes: https://lore.kernel.org/linux-iommu/87a5uz3ob8.fsf@meer.lwn.net/
Signed-off-by: Petr Tesarik <petr@tesarici.cz>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Commit 8ac04063354a ("swiotlb: reduce the number of areas to match
actual memory pool size") calculated the reduced number of areas in
swiotlb_init_remap() but didn't actually use the value. Replace usage of
default_nareas accordingly.
Fixes: 8ac04063354a ("swiotlb: reduce the number of areas to match actual memory pool size")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |\ \ \ \ \ \
| | |_|/ / / /
| |/| | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Remove double allocation of wq_update_pod_attrs_buf
- Fix missing allocation of pwq_release_worker when
wq_cpu_intensive_thresh_us is set to a custom value
* tag 'wq-for-6.6-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init()
workqueue: Removed double allocation of wq_update_pod_attrs_buf
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
wq_cpu_intensive_thresh_init()
Currently, if the wq_cpu_intensive_thresh_us is set to specific
value, will cause the wq_cpu_intensive_thresh_init() early exit
and missed creation of pwq_release_worker. this commit therefore
create the pwq_release_worker in advance before checking the
wq_cpu_intensive_thresh_us.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 967b494e2fd1 ("workqueue: Use a kthread_worker to release pool_workqueues")
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
First commit 2930155b2e272 ("workqueue: Initialize unbound CPU pods later in
the boot") added the initialization of wq_update_pod_attrs_buf to
workqueue_init_early(), and then latter on, commit 84193c07105c6
("workqueue: Generalize unbound CPU pods") added it as well. This appeared
in a kmemleak run where the second allocation made the first allocation
leak.
Fixes: 84193c07105c6 ("workqueue: Generalize unbound CPU pods")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| |\ \ \ \ \ \
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix the "bytes" output of the per_cpu stat file
The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the
accounting was not accurate. It is suppose to show how many used
bytes are still in the ring buffer, but even when the ring buffer was
empty it would still show there were bytes used.
- Fix a bug in eventfs where reading a dynamic event directory (open)
and then creating a dynamic event that goes into that diretory screws
up the accounting.
On close, the newly created event dentry will get a "dput" without
ever having a "dget" done for it. The fix is to allocate an array on
dir open to save what dentries were actually "dget" on, and what ones
to "dput" on close.
* tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Remember what dentries were created on dir open
ring-buffer: Fix bytes info in per_cpu buffer stats
|
| | |/ / / / /
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.
# cat per_cpu/cpu0/stats
entries: 0
overrun: 0
commit overrun: 0
bytes: 568 <--- 'bytes' is problematical !!!
oldest event ts: 8651.371479
now ts: 8653.912224
dropped events: 0
read events: 8
The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
1. When stat 'read_bytes', account consumed event in rb_advance_reader();
2. When stat 'entries_bytes', exclude the discarded padding event which
is smaller than minimum size because it is invisible to reader. Then
use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
page-based read/remove/overrun.
Also correct the comments of ring_buffer_bytes_cpu() in this patch.
Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
| |\ \ \ \ \ \
| | | |_|/ / /
| | |/| | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes, 10 of which pertain to post-6.5 issues. The other three
are cc:stable"
* tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
proc: nommu: fix empty /proc/<pid>/maps
filemap: add filemap_map_order0_folio() to handle order0 folio
proc: nommu: /proc/<pid>/maps: release mmap read lock
mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcement
pidfd: prevent a kernel-doc warning
argv_split: fix kernel-doc warnings
scatterlist: add missing function params to kernel-doc
selftests/proc: fixup proc-empty-vm test after KSM changes
revert "scripts/gdb/symbols: add specific ko module load command"
selftests: link libasan statically for tests with -fsanitize=address
task_work: add kerneldoc annotation for 'data' argument
mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list
sh: mm: re-add lost __ref to ioremap_prot() to fix modpost warning
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Change the comment to match the function name that the SYSCALL_DEFINE()
macros generate to prevent a kernel-doc warning.
kernel/pid.c:628: warning: expecting prototype for pidfd_open(). Prototype was for sys_pidfd_open() instead
Link: https://lkml.kernel.org/r/20230912060822.2500-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
| | |/ / / /
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
A previous commit changed the arguments to task_work_cancel_match(), but
didn't document all of them.
Link: https://lkml.kernel.org/r/93938bff-baa3-4091-85f5-784aae297a07@kernel.dk
Fixes: c7aab1a7c52b ("task_work: add helper for more targeted task_work canceling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202309120307.zis3yQGe-lkp@intel.com/
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
| |\ \ \ \ \
| | |_|_|/ /
| |/| | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a PF_IDLE initialization bug that generated warnings on tiny-RCU"
* tag 'sched-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/sched: Modify initial boot task idle setup
|
| | |/ / /
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle(). Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set. Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed. Recent changes
have caused just this scenario and IRQs have been enabled early.
This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.
Fix this issue by setting the PF_IDLE flag later in the boot sequence.
Although the boot task was marked as idle since (at least) d80e4fda576d,
I am not sure that it is wrong to do so. The forced context-switch on
idle task was introduced in the tiny_rcu update, so I'm going to claim
this fixes 5f6130fa52ee.
Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
|
|\| | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Cross-merge networking fixes after downstream PR.
No conflicts.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |\ \ \ \
| | |/ / /
| |/| | /
| | | |/
| | |/|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter and bpf.
Current release - regressions:
- bpf: adjust size_index according to the value of KMALLOC_MIN_SIZE
- netfilter: fix entries val in rule reset audit log
- eth: stmmac: fix incorrect rxq|txq_stats reference
Previous releases - regressions:
- ipv4: fix null-deref in ipv4_link_failure
- netfilter:
- fix several GC related issues
- fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
- eth: team: fix null-ptr-deref when team device type is changed
- eth: i40e: fix VF VLAN offloading when port VLAN is configured
- eth: ionic: fix 16bit math issue when PAGE_SIZE >= 64KB
Previous releases - always broken:
- core: fix ETH_P_1588 flow dissector
- mptcp: fix several connection hang-up conditions
- bpf:
- avoid deadlock when using queue and stack maps from NMI
- add override check to kprobe multi link attach
- hsr: properly parse HSRv1 supervisor frames.
- eth: igc: fix infinite initialization loop with early XDP redirect
- eth: octeon_ep: fix tx dma unmap len values in SG
- eth: hns3: fix GRE checksum offload issue"
* tag 'net-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
sfc: handle error pointers returned by rhashtable_lookup_get_insert_fast()
igc: Expose tx-usecs coalesce setting to user
octeontx2-pf: Do xdp_do_flush() after redirects.
bnxt_en: Flush XDP for bnxt_poll_nitroa0()'s NAPI
net: ena: Flush XDP packets on error.
net/handshake: Fix memory leak in __sock_create() and sock_alloc_file()
net: hinic: Fix warning-hinic_set_vlan_fliter() warn: variable dereferenced before check 'hwdev'
netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
netfilter: nf_tables: fix memleak when more than 255 elements expired
netfilter: nf_tables: disable toggling dormant table state more than once
vxlan: Add missing entries to vxlan_get_size()
net: rds: Fix possible NULL-pointer dereference
team: fix null-ptr-deref when team device type is changed
net: bridge: use DEV_STATS_INC()
net: hns3: add 5ms delay before clear firmware reset irq source
net: hns3: fix fail to delete tc flower rules during reset issue
net: hns3: only enable unicast promisc when mac table full
net: hns3: fix GRE checksum offload issue
net: hns3: add cmdq check for vf periodic service task
net: stmmac: fix incorrect rxq|txq_stats reference
...
|
| | |\ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Alexei Starovoitov says:
====================
The following pull-request contains BPF updates for your *net* tree.
We've added 21 non-merge commits during the last 8 day(s) which contain
a total of 21 files changed, 450 insertions(+), 36 deletions(-).
The main changes are:
1) Adjust bpf_mem_alloc buckets to match ksize(), from Hou Tao.
2) Check whether override is allowed in kprobe mult, from Jiri Olsa.
3) Fix btf_id symbol generation with ld.lld, from Jiri and Nick.
4) Fix potential deadlock when using queue and stack maps from NMI, from Toke Høiland-Jørgensen.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Thanks a lot!
Also thanks to reporters, reviewers and testers of commits in this pull-request:
Alan Maguire, Biju Das, Björn Töpel, Dan Carpenter, Daniel Borkmann,
Eduard Zingerman, Hsin-Wei Hung, Marcus Seyfarth, Nathan Chancellor,
Satya Durga Srinivasu Prabhala, Song Liu, Stephen Rothwell
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Dan reported Smatch static checker warning due to missing error
value set in uprobe multi link's get_pid_task error path.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/bpf/c5ffa7c0-6b06-40d5-aca2-63833b5cd9af@moroto.mountain/
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230915101420.1193800-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
For global per-cpu allocator, the size of free object in free list
doesn't match with unit_size and now there is no way to get the size of
per-cpu pointer saved in free object, so just skip the checking.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/bpf/20230913133436.0eeec4cb@canb.auug.org.au/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Link: https://lore.kernel.org/r/20230913135943.3137292-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Fix missing or extra function parameter kernel-doc warnings
in cgroup.c:
kernel/bpf/cgroup.c:1359: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_skb'
kernel/bpf/cgroup.c:1359: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_skb'
kernel/bpf/cgroup.c:1439: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sk'
kernel/bpf/cgroup.c:1439: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sk'
kernel/bpf/cgroup.c:1467: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_addr'
kernel/bpf/cgroup.c:1467: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_addr'
kernel/bpf/cgroup.c:1512: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_ops'
kernel/bpf/cgroup.c:1512: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_ops'
kernel/bpf/cgroup.c:1685: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sysctl'
kernel/bpf/cgroup.c:1685: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sysctl'
kernel/bpf/cgroup.c:795: warning: Excess function parameter 'type' description in '__cgroup_bpf_replace'
kernel/bpf/cgroup.c:795: warning: Function parameter or member 'new_prog' not described in '__cgroup_bpf_replace'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230912060812.1715-1-rdunlap@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
snprintf() does not return negative error code on error, it returns the
number of characters which *would* be generated for the given input.
Fix the error handling check.
Fixes: 57539b1c0ac2 ("bpf: Enable annotating trusted nested pointers")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/393bdebc87b22563c08ace094defa7160eb7a6c0.1694190795.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Fix for a bug observable under the following sequence of events:
1. Create a network device that does not support XDP offload.
2. Load a device bound XDP program with BPF_F_XDP_DEV_BOUND_ONLY flag
(such programs are not offloaded).
3. Load a device bound XDP program with zero flags
(such programs are offloaded).
At step (2) __bpf_prog_dev_bound_init() associates with device (1)
a dummy bpf_offload_netdev struct with .offdev field set to NULL.
At step (3) __bpf_prog_dev_bound_init() would reuse dummy struct
allocated at step (2).
However, downstream usage of the bpf_offload_netdev assumes that
.offdev field can't be NULL, e.g. in bpf_prog_offload_verifier_prep().
Adjust __bpf_prog_dev_bound_init() to require bpf_offload_netdev
with non-NULL .offdev for offloaded BPF programs.
Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs")
Reported-by: syzbot+291100dcb32190ec02a8@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000d97f3c060479c4f8@google.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230912005539.2248244-2-eddyz87@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Sysbot discovered that the queue and stack maps can deadlock if they are
being used from a BPF program that can be called from NMI context (such as
one that is attached to a perf HW counter event). To fix this, add an
in_nmi() check and use raw_spin_trylock() in NMI context, erroring out if
grabbing the lock fails.
Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Tested-by: Hsin-Wei Hung <hsinweih@uci.edu>
Co-developed-by: Hsin-Wei Hung <hsinweih@uci.edu>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230911132815.717240-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Add extra check in bpf_mem_alloc_init() to ensure the unit_size of
bpf_mem_cache is matched with the object_size of underlying slab cache.
If these two sizes are unmatched, print a warning once and return
-EINVAL in bpf_mem_alloc_init(), so the mismatch can be found early and
the potential issue can be prevented.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230908133923.2675053-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When the unit_size of a bpf_mem_cache is unmatched with the object_size
of the underlying slab cache, the bpf_mem_cache will not be used, and
the allocation will be redirected to a bpf_mem_cache with a bigger
unit_size instead, so there is no need to prefill for these
unused bpf_mem_caches.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230908133923.2675053-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The following warning was reported when running "./test_progs -a
link_api -a linked_list" on a RISC-V QEMU VM:
------------[ cut here ]------------
WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill
Modules linked in: bpf_testmod(OE)
CPU: 3 PID: 261 Comm: test_progs- ... 6.5.0-rc5-01743-gdcb152bb8328 #2
Hardware name: riscv-virtio,qemu (DT)
epc : bpf_mem_refill+0x1fc/0x206
ra : irq_work_single+0x68/0x70
epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
t5 : ff6000008fd28278 t6 : 0000000000040000
[<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
[<ffffffff8015fe84>] irq_work_single+0x68/0x70
[<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
[<ffffffff8015fefa>] irq_work_run+0x38/0x66
[<ffffffff8000828a>] handle_IPI+0x3a/0xb4
[<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
[<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
[<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
[<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
[<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
[<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
[<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
[<ffffffff812b6904>] do_irq+0x66/0x98
---[ end trace 0000000000000000 ]---
The warning is due to WARN_ON_ONCE(tgt->unit_size != c->unit_size) in
free_bulk(). The direct reason is that a object is allocated and
freed by bpf_mem_caches with different unit_size.
The root cause is that KMALLOC_MIN_SIZE is 64 and there is no 96-bytes
slab cache in the specific VM. When linked_list test allocates a
72-bytes object through bpf_obj_new(), bpf_global_ma will allocate it
from a bpf_mem_cache with 96-bytes unit_size, but this bpf_mem_cache is
backed by 128-bytes slab cache. When the object is freed, bpf_mem_free()
uses ksize() to choose the corresponding bpf_mem_cache. Because the
object is allocated from 128-bytes slab cache, ksize() returns 128,
bpf_mem_free() chooses a 128-bytes bpf_mem_cache to free the object and
triggers the warning.
A similar warning will also be reported when using CONFIG_SLAB instead
of CONFIG_SLUB in a x86-64 kernel. Because CONFIG_SLUB defines
KMALLOC_MIN_SIZE as 8 but CONFIG_SLAB defines KMALLOC_MIN_SIZE as 32.
An alternative fix is to use kmalloc_size_round() in bpf_mem_alloc() to
choose a bpf_mem_cache which has the same unit_size with the backing
slab cache, but it may introduce performance degradation, so fix the
warning by adjusting the indexes in size_index according to the value of
KMALLOC_MIN_SIZE just like setup_kmalloc_cache_index_table() does.
Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.")
Reported-by: Björn Töpel <bjorn@kernel.org>
Closes: https://lore.kernel.org/bpf/87jztjmmy4.fsf@all.your.base.are.belong.to.us
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230908133923.2675053-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Currently the multi_kprobe link attach does not check error
injection list for programs with bpf_override_return helper
and allows them to attach anywhere. Adding the missing check.
Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20230907200652.926951-1-jolsa@kernel.org
|
| |\ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix a performance regression on large SMT systems, an Intel SMT4
balancing bug, and a topology setup bug on (Intel) hybrid processors"
* tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sched: Restore the SD_ASYM_PACKING flag in the DIE domain
sched/fair: Fix SMT4 group_smt_balance handling
sched/fair: Optimize should_we_balance() for large SMT systems
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.
Also, handle rounding effect of adding (ncores_local + ncores_busy) when
the local is fully idle and busy group imbalance is less than 2 tasks.
Local group should try to pull at least 1 task in this case so imbalance
should be set to 2 instead.
Fixes: fee1759e4f04 ("sched/fair: Determine active load balance for SMT sched groups")
Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/6cd1633036bb6b651af575c32c2a9608a106702c.camel@linux.intel.com
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
should_we_balance() is called in load_balance() to find out if the CPU that
is trying to do the load balance is the right one or not.
With commit:
b1bfeab9b002("sched/fair: Consider the idle state of the whole core for load balance")
the code tries to find an idle core to do the load balancing
and falls back on an idle sibling CPU if there is no idle core.
However, on larger SMT systems, it could be needlessly iterating to find a
idle by scanning all the CPUs in an non-idle core. If the core is not idle,
and first SMT sibling which is idle has been found, then its not needed to
check other SMT siblings for idleness
Lets say in SMT4, Core0 has 0,2,4,6 and CPU0 is BUSY and rest are IDLE.
balancing domain is MC/DIE. CPU2 will be set as the first idle_smt and
same process would be repeated for CPU4 and CPU6 but this is unnecessary.
Since calling is_core_idle loops through all CPU's in the SMT mask, effect
is multiplied by weight of smt_mask. For example,when say 1 CPU is busy,
we would skip loop for 2 CPU's and skip iterating over 8CPU's. That
effect would be more in DIE/NUMA domain where there are more cores.
Testing and performance evaluation
==================================
The test has been done on this system which has 12 cores, i.e 24 small
cores with SMT=4:
lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Model name: POWER10 (architected), altivec supported
Thread(s) per core: 8
Used funclatency bcc tool to evaluate the time taken by should_we_balance(). For
base tip/sched/core the time taken is collected by making the
should_we_balance() noinline. time is in nanoseconds. The values are
collected by running the funclatency tracer for 60 seconds. values are
average of 3 such runs. This represents the expected reduced time with
patch.
tip/sched/core was at commit:
2f88c8e802c8 ("sched/eevdf/doc: Modify the documented knob to base_slice_ns as well")
Results:
------------------------------------------------------------------------------
workload tip/sched/core with_patch(%gain)
------------------------------------------------------------------------------
idle system 809.3 695.0(16.45)
stress ng – 12 threads -l 100 1013.5 893.1(13.49)
stress ng – 24 threads -l 100 1073.5 980.0(9.54)
stress ng – 48 threads -l 100 683.0 641.0(6.55)
stress ng – 96 threads -l 100 2421.0 2300(5.26)
stress ng – 96 threads -l 15 375.5 377.5(-0.53)
stress ng – 96 threads -l 25 635.5 637.5(-0.31)
stress ng – 96 threads -l 35 934.0 891.0(4.83)
Ran schbench(old), hackbench and stress_ng to evaluate the workload
performance between tip/sched/core and with patch.
No modification to tip/sched/core
TL;DR:
Good improvement is seen with schbench. when hackbench and stress_ng
runs for longer good improvement is seen.
------------------------------------------------------------------------------
schbench(old) tip +patch(%gain)
10 iterations sched/core
------------------------------------------------------------------------------
1 Threads
50.0th: 8.00 9.00(-12.50)
75.0th: 9.60 9.00(6.25)
90.0th: 11.80 10.20(13.56)
95.0th: 12.60 10.40(17.46)
99.0th: 13.60 11.90(12.50)
99.5th: 14.10 12.60(10.64)
99.9th: 15.90 14.60(8.18)
2 Threads
50.0th: 9.90 9.20(7.07)
75.0th: 12.60 10.10(19.84)
90.0th: 15.50 12.00(22.58)
95.0th: 17.70 14.00(20.90)
99.0th: 21.20 16.90(20.28)
99.5th: 22.60 17.50(22.57)
99.9th: 30.40 19.40(36.18)
4 Threads
50.0th: 12.50 10.60(15.20)
75.0th: 15.30 12.00(21.57)
90.0th: 18.60 14.10(24.19)
95.0th: 21.30 16.20(23.94)
99.0th: 26.00 20.70(20.38)
99.5th: 27.60 22.50(18.48)
99.9th: 33.90 31.40(7.37)
8 Threads
50.0th: 16.30 14.30(12.27)
75.0th: 20.20 17.40(13.86)
90.0th: 24.50 21.90(10.61)
95.0th: 27.30 24.70(9.52)
99.0th: 35.00 31.20(10.86)
99.5th: 46.40 33.30(28.23)
99.9th: 89.30 57.50(35.61)
16 Threads
50.0th: 22.70 20.70(8.81)
75.0th: 30.10 27.40(8.97)
90.0th: 36.00 32.80(8.89)
95.0th: 39.60 36.40(8.08)
99.0th: 49.20 44.10(10.37)
99.5th: 64.90 50.50(22.19)
99.9th: 143.50 100.60(29.90)
32 Threads
50.0th: 34.60 35.50(-2.60)
75.0th: 48.20 50.50(-4.77)
90.0th: 59.20 62.40(-5.41)
95.0th: 65.20 69.00(-5.83)
99.0th: 80.40 83.80(-4.23)
99.5th: 102.10 98.90(3.13)
99.9th: 727.10 506.80(30.30)
schbench does improve in general. There is some run to run variation with
schbench. Did a validation run to confirm that trend is similar.
------------------------------------------------------------------------------
hackbench tip +patch(%gain)
20 iterations, 50000 loops sched/core
------------------------------------------------------------------------------
Process 10 groups : 11.74 11.70(0.34)
Process 20 groups : 22.73 22.69(0.18)
Process 30 groups : 33.39 33.40(-0.03)
Process 40 groups : 43.73 43.61(0.27)
Process 50 groups : 53.82 54.35(-0.98)
Process 60 groups : 64.16 65.29(-1.76)
thread 10 Time : 12.81 12.79(0.16)
thread 20 Time : 24.63 24.47(0.65)
Process(Pipe) 10 Time : 6.40 6.34(0.94)
Process(Pipe) 20 Time : 10.62 10.63(-0.09)
Process(Pipe) 30 Time : 15.09 14.84(1.66)
Process(Pipe) 40 Time : 19.42 19.01(2.11)
Process(Pipe) 50 Time : 24.04 23.34(2.91)
Process(Pipe) 60 Time : 28.94 27.51(4.94)
thread(Pipe) 10 Time : 6.96 6.87(1.29)
thread(Pipe) 20 Time : 11.74 11.73(0.09)
hackbench shows slight improvement with pipe. Slight degradation in process.
------------------------------------------------------------------------------
stress_ng tip +patch(%gain)
10 iterations 100000 cpu_ops sched/core
------------------------------------------------------------------------------
--cpu=96 -util=100 Time taken : 5.30, 5.01(5.47)
--cpu=48 -util=100 Time taken : 7.94, 6.73(15.24)
--cpu=24 -util=100 Time taken : 11.67, 8.75(25.02)
--cpu=12 -util=100 Time taken : 15.71, 15.02(4.39)
--cpu=96 -util=10 Time taken : 22.71, 22.19(2.29)
--cpu=96 -util=20 Time taken : 12.14, 12.37(-1.89)
--cpu=96 -util=30 Time taken : 8.76, 8.86(-1.14)
--cpu=96 -util=40 Time taken : 7.13, 7.14(-0.14)
--cpu=96 -util=50 Time taken : 6.10, 6.13(-0.49)
--cpu=96 -util=60 Time taken : 5.42, 5.41(0.18)
--cpu=96 -util=70 Time taken : 4.94, 4.94(0.00)
--cpu=96 -util=80 Time taken : 4.56, 4.53(0.66)
--cpu=96 -util=90 Time taken : 4.27, 4.26(0.23)
Good improvement seen with 24 CPUs. In this case only one CPU is busy,
and no core is idle. Decent improvement with 100% utilization case. no
difference in other utilization.
Fixes: b1bfeab9b002 ("sched/fair: Consider the idle state of the whole core for load balance")
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230902081204.232218-1-sshegde@linux.vnet.ibm.com
|
| |\ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull WARN fix from Ingo Molnar:
"Fix a missing preempt-enable in the WARN() slowpath"
* tag 'core-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
panic: Reenable preemption in WARN slowpath
|
| | | |/ / /
| | |/| | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Commit:
5a5d7e9badd2 ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG")
amended warn_slowpath_fmt() to disable preemption until the WARN splat
has been emitted.
However the commit neglected to reenable preemption in the !fmt codepath,
i.e. when a WARN splat is emitted without additional format string.
One consequence is that users may see more splats than intended. E.g. a
WARN splat emitted in a work item results in at least two extra splats:
BUG: workqueue leaked lock or atomic
(emitted by process_one_work())
BUG: scheduling while atomic
(emitted by worker_thread() -> schedule())
Ironically the point of the commit was to *avoid* extra splats. ;)
Fix it.
Fixes: 5a5d7e9badd2 ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/3ec48fde01e4ee6505f77908ba351bad200ae3d1.1694763684.git.lukas@wunner.de
|
| |\ \ \ \ \
| | |/ / / /
| |/| | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Fix the handling of block devices in the test_resume mode of
hibernation (Chen Yu)"
* tag 'pm-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Fix the exclusive get block device in test_resume mode
PM: hibernate: Rename function parameter from snapshot_test to exclusive
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Commit 5904de0d735b ("PM: hibernate: Do not get block device exclusively
in test_resume mode") fixes a hibernation issue under test_resume mode.
That commit is supposed to open the block device in non-exclusive mode
when in test_resume. However the code does the opposite, which is against
its description.
In summary, the swap device is only opened exclusively by swsusp_check()
with its corresponding *close(), and must be in non test_resume mode.
This is to avoid the race condition that different processes scribble the
device at the same time. All the other cases should use non-exclusive mode.
Fix it by really disabling exclusive mode under test_resume.
Fixes: 5904de0d735b ("PM: hibernate: Do not get block device exclusively in test_resume mode")
Closes: https://lore.kernel.org/lkml/000000000000761f5f0603324129@google.com/
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Chenzhou Feng <chenzhoux.feng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
| | | |_|/
| | |/| |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Several functions reply on snapshot_test to decide whether to
open the resume device exclusively. However there is no strict
connection between the snapshot_test and the open mode. Rename
the 'snapshot_test' input parameter to 'exclusive' to better reflect
the use case.
No functional change is expected.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Remove unused prev_offset, min_size, krec_size variables.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202309190634.fL17FWoT-lkp@intel.com/
Fixes: aaa619ebccb2 ("bpf: Refactor check_btf_func and split into two phases")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
On 32-bit architectures, the pointer width is 32-bit, while we try to
cast from a u64 down to it, the compiler complains on mismatch in
integer size. Fix this by first casting to long which should match
the pointer width on targets supported by Linux.
Fixes: ec5290a178b7 ("bpf: Prevent KASAN false positive with bpf_throw")
Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20230918155233.297024-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|