summaryrefslogtreecommitdiffstats
path: root/kernel (follow)
Commit message (Collapse)AuthorAgeFilesLines
* sched/core: Fix incorrect initialization of the 'burst' parameter in ↵Cheng Yu2024-05-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpu_max_write() In the cgroup v2 CPU subsystem, assuming we have a cgroup named 'test', and we set cpu.max and cpu.max.burst: # echo 1000000 > /sys/fs/cgroup/test/cpu.max # echo 1000000 > /sys/fs/cgroup/test/cpu.max.burst then we check cpu.max and cpu.max.burst: # cat /sys/fs/cgroup/test/cpu.max 1000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000000 Next we set cpu.max again and check cpu.max and cpu.max.burst: # echo 2000000 > /sys/fs/cgroup/test/cpu.max # cat /sys/fs/cgroup/test/cpu.max 2000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000 ... we find that the cpu.max.burst value changed unexpectedly. In cpu_max_write(), the unit of the burst value returned by tg_get_cfs_burst() is microseconds, while in cpu_max_write(), the burst unit used for calculation should be nanoseconds, which leads to the bug. To fix it, get the burst value directly from tg->cfs_bandwidth.burst. Fixes: f4183717b370 ("sched/fair: Introduce the burstable CFS controller") Reported-by: Qixin Liao <liaoqixin@huawei.com> Signed-off-by: Cheng Yu <serein.chengyu@huawei.com> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240424132438.514720-1-serein.chengyu@huawei.com
* sched/fair: Remove stale FREQUENCY_UTIL commentChristian Loehle2024-05-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | On 05/03/2024 15:05, Vincent Guittot wrote: I'm fine with either and that was my first thought here, too, but it did seem like the comment was mostly placed there to justify the 'unexpected' high utilization when explicitly passing FREQUENCY_UTIL and the need to clamp it then. So removing did feel slightly more natural to me anyway. So alternatively: From: Christian Loehle <christian.loehle@arm.com> Date: Tue, 5 Mar 2024 09:34:41 +0000 Subject: [PATCH] sched/fair: Remove stale FREQUENCY_UTIL mention effective_cpu_util() flags were removed, so remove mentioning of the flag. commit 9c0b4bb7f6303 ("sched/cpufreq: Rework schedutil governor performance estimation") reworked effective_cpu_util() removing enum cpu_util_type. Modify the comment accordingly. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/0e2833ee-0939-44e0-82a2-520a585a0153@arm.com
* sched/fair: Fix initial util_avg calculationDawei Li2024-05-171-2/+3
| | | | | | | | | | | | | | | | | | | | | Change se->load.weight to se_weight(se) in the calculation for the initial util_avg to avoid unnecessarily inflating the util_avg by 1024 times. The reason is that se->load.weight has the unit/scale as the scaled-up load, while cfs_rg->avg.load_avg has the unit/scale as the true task weight (as mapped directly from the task's nice/priority value). With CONFIG_32BIT, the scaled-up load is equal to the true task weight. With CONFIG_64BIT, the scaled-up load is 1024 times the true task weight. Thus, the current code may inflate the util_avg by 1024 times. The follow-up capping will not allow the util_avg value to go wild. But the calculation should have the correct logic. Signed-off-by: Dawei Li <daweilics@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240315015916.21545-1-daweilics@gmail.com
* sched/debug: Dump domains' levelVitalii Bursov2024-05-171-0/+1
| | | | | | | | | | | | | | | | | | | | | Knowing domain's level exactly can be useful when setting relax_domain_level or cpuset.sched_relax_domain_level Usage: cat /debug/sched/domains/cpu0/domain1/level to dump cpu0 domain1's level. SDM macro is not used because sd->level is 'int' and it would hide the type mismatch between 'int' and 'u32'. Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/9489b6475f6dd6fbc67c617752d4216fa094da53.1714488502.git.vitaly@bursov.com
* sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_levelVitalii Bursov2024-05-172-2/+2
| | | | | | | | | | | | | | | | | | | | | | Change relax_domain_level checks so that it would be possible to include or exclude all domains from newidle balancing. This matches the behavior described in the documentation: -1 no request. use system default or follow request of others. 0 no search. 1 search siblings (hyperthreads in a core). "2" enables levels 0 and 1, level_max excludes the last (level_max) level, and level_max+1 includes all levels. Fixes: 1d3504fcf560 ("sched, cpuset: customize sched domains, core") Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
* Merge tag 'irq-core-2024-05-12' of ↵Linus Torvalds2024-05-149-53/+311
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: "Core code: - Interrupt storm detection for the lockup watchdog: Lockups which are caused by interrupt storms are not easy to debug because there is no information about the events which make the lockup detector trigger. To make this more user friendly, provide an extenstion to interrupt statistics which allows to take snapshots and an interface to retrieve the delta to the snapshot. Use this new mechanism in the watchdog code to do a two stage lockup analysis by taking the snapshot and printing the deltas for the topmost active interrupts on the second trigger. Note: This contains both the interrupt and the watchdog changes as the latter depend on the former obviously. - Avoid summation loops in the /proc/interrupts output and use the global counter when possible - Skip suspended interrupts on CPU hotplug operations to ensure that they are not delivered before the system resumes the device drivers when coming out of suspend. - On CPU hot-unplug interrupts which are affine to the outgoing CPU are migrated to a different CPU in the affinity mask. This can fail when the CPUs have no vectors left. Instead of giving up try to migrate it to any online CPU and thereby breaking the affinity setting in order to prevent a stale device interrupt which targets an offline CPU - The usual small cleanups Driver code: - Support for the RISCV AIA MSI controller - Make the interrupt allocation for the Loongson PCH controller more flexible to prevent vector exhaustion - The usual set of cleanups and fixes all over the place" * tag 'irq-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) irqchip/gic-v3-its: Remove BUG_ON in its_vpe_irq_domain_alloc cpuidle: Avoid explicit cpumask allocation on stack irqchip/sifive-plic: Avoid explicit cpumask allocation on stack irqchip/riscv-aplic-direct: Avoid explicit cpumask allocation on stack irqchip/loongson-eiointc: Avoid explicit cpumask allocation on stack irqchip/gic-v3-its: Avoid explicit cpumask allocation on stack irqchip/irq-bcm6345-l1: Avoid explicit cpumask allocation on stack cpumask: Introduce cpumask_first_and_and() irqchip/irq-brcmstb-l2: Avoid saving mask on shutdown genirq: Reuse irq_is_nmi() genirq/cpuhotplug: Retry with cpu_online_mask when migration fails genirq/cpuhotplug: Skip suspended interrupts when restoring affinity arm64: dts: st: Add interrupt parent to pinctrl on stm32mp251 arm64: dts: st: Add exti1 and exti2 nodes on stm32mp251 ARM: dts: stm32: List exti parent interrupts on stm32mp131 ARM: dts: stm32: List exti parent interrupts on stm32mp151 arm64: Kconfig.platforms: Enable STM32_EXTI for ARCH_STM32 irqchip/stm32-exti: Mark events reserved with RIF configuration check irqchip/stm32-exti: Skip secure events irqchip/stm32-exti: Convert driver to standard PM ...
| * genirq: Reuse irq_is_nmi()Jinjie Ruan2024-04-244-14/+14
| | | | | | | | | | | | | | | | | | Move irq_is_nmi() to the internal header file and reuse it all over the place. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423024037.3331215-1-ruanjinjie@huawei.com
| * genirq/cpuhotplug: Retry with cpu_online_mask when migration failsDongli Zhang2024-04-241-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a CPU goes offline, the interrupts affine to that CPU are re-configured. Managed interrupts undergo either migration to other CPUs or shutdown if all CPUs listed in the affinity are offline. The migration of managed interrupts is guaranteed on x86 because there are interrupt vectors reserved. Regular interrupts are migrated to a still online CPU in the affinity mask or if there is no online CPU to any online CPU. This works as long as the still online CPUs in the affinity mask have interrupt vectors available, but in case that none of those CPUs has a vector available the migration fails and the device interrupt becomes stale. This is not any different from the case where the affinity mask does not contain any online CPU, but there is no fallback operation for this. Instead of giving up, retry the migration attempt with the online CPU mask if the interrupt is not managed, as managed interrupts cannot be affected by this problem. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423073413.79625-1-dongli.zhang@oracle.com
| * genirq/cpuhotplug: Skip suspended interrupts when restoring affinityDavid Stevens2024-04-242-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | irq_restore_affinity_of_irq() restarts managed interrupts unconditionally when the first CPU in the affinity mask comes online. That's correct during normal hotplug operations, but not when resuming from S3 because the drivers are not resumed yet and interrupt delivery is not expected by them. Skip the startup of suspended interrupts and let resume_device_irqs() deal with restoring them. This ensures that irqs are not delivered to drivers during the noirq phase of resuming from S3, after non-boot CPUs are brought back online. Signed-off-by: David Stevens <stevensd@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240424090341.72236-1-stevensd@chromium.org
| * genirq: Simplify the checks for irq_set_percpu_devid_partition()Jinjie Ruan2024-04-231-8/+2
| | | | | | | | | | | | | | | | | | | | | | Since whether desc is NULL or desc->percpu_enabled is true, it returns -EINVAL, check them together, and assign desc->percpu_affinity using a ternary to simplify the code. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240417085356.3785381-1-ruanjinjie@huawei.com
| * watchdog/softlockup: Report the most frequent interruptsBitao Hu2024-04-121-4/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the watchdog determines that the current soft lockup is due to an interrupt storm based on CPU utilization, reporting the most frequent interrupts could be good enough for further troubleshooting. Below is an example of interrupt storm. The call tree does not provide useful information, but analyzing which interrupt caused the soft lockup by comparing the counts of interrupts during the lockup period allows to identify the culprit. [ 638.870231] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [swapper/9:0] [ 638.870825] CPU#9 Utilization every 4s during lockup: [ 638.871194] #1: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.871652] #2: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.872107] #3: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.872563] #4: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.873018] #5: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.873494] CPU#9 Detect HardIRQ Time exceeds 50%. Most frequent HardIRQs: [ 638.873994] #1: 330945 irq#7 [ 638.874236] #2: 31 irq#82 [ 638.874493] #3: 10 irq#10 [ 638.874744] #4: 2 irq#89 [ 638.874992] #5: 1 irq#102 ... [ 638.875313] Call trace: [ 638.875315] __do_softirq+0xa8/0x364 Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20240411074134.30922-6-yaoma@linux.alibaba.com
| * watchdog/softlockup: Low-overhead detection of interrupt stormBitao Hu2024-04-121-1/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following softlockup is caused by interrupt storm, but it cannot be identified from the call tree. Because the call tree is just a snapshot and doesn't fully capture the behavior of the CPU during the soft lockup. watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] ... Call trace: __do_softirq+0xa0/0x37c __irq_exit_rcu+0x108/0x140 irq_exit+0x14/0x20 __handle_domain_irq+0x84/0xe0 gic_handle_irq+0x80/0x108 el0_irq_naked+0x50/0x58 Therefore, it is necessary to report CPU utilization during the softlockup_threshold period (report once every sample_period, for a total of 5 reportings), like this: watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] CPU#28 Utilization every 4s during lockup: #1: 0% system, 0% softirq, 100% hardirq, 0% idle #2: 0% system, 0% softirq, 100% hardirq, 0% idle #3: 0% system, 0% softirq, 100% hardirq, 0% idle #4: 0% system, 0% softirq, 100% hardirq, 0% idle #5: 0% system, 0% softirq, 100% hardirq, 0% idle ... This is helpful in determining whether an interrupt storm has occurred or in identifying the cause of the softlockup. The criteria for determination are as follows: a. If the hardirq utilization is high, then interrupt storm should be considered and the root cause cannot be determined from the call tree. b. If the softirq utilization is high, then the call might not necessarily point at the root cause. c. If the system utilization is high, then analyzing the root cause from the call tree is possible in most cases. The mechanism requires a considerable amount of global storage space when configured for the maximum number of CPUs. Therefore, adding a SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes" if the max number of CPUs is <= 128. Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240411074134.30922-5-yaoma@linux.alibaba.com
| * genirq: Avoid summation loops for /proc/interruptsBitao Hu2024-04-123-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | show_interrupts() unconditionally accumulates the per CPU interrupt statistics to determine whether an interrupt was ever raised. This can be avoided for all interrupts which are not strictly per CPU and not of type NMI because those interrupts provide already an accumulated counter. The required logic is already implemented in kstat_irqs(). Split the inner access logic out of kstat_irqs() and use it for kstat_irqs() and show_interrupts() to avoid the accumulation loop when possible. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20240411074134.30922-4-yaoma@linux.alibaba.com
| * genirq: Provide a snapshot mechanism for interrupt statisticsBitao Hu2024-04-122-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The soft lockup detector lacks a mechanism to identify interrupt storms as root cause of a lockup. To enable this the detector needs a mechanism to snapshot the interrupt count statistics on a CPU when the detector observes a potential lockup scenario and compare that against the interrupt count when it warns about the lockup later on. The number of interrupts in that period give a hint whether the lockup might have been caused by an interrupt storm. Instead of having extra storage in the lockup detector and accessing the internals of the interrupt descriptor directly, add a snapshot member to the per CPU irq_desc::kstat_irq structure and provide interfaces to take a snapshot of all interrupts on the current CPU and to retrieve the delta of a specific interrupt later on. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240411074134.30922-3-yaoma@linux.alibaba.com
| * genirq: Convert kstat_irqs to a structBitao Hu2024-04-123-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The irq_desc::kstat_irqs member is a per-CPU variable of type int, which is only capable of counting. A snapshot mechanism for interrupt statistics will be added soon, which requires an additional variable to store the snapshot. To facilitate expansion, convert kstat_irqs here to a struct containing only the count. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240411074134.30922-2-yaoma@linux.alibaba.com
| * irqdomain: Check virq for 0 before use in irq_dispose_mapping()Andy Shevchenko2024-04-081-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's a bit hard to read the logic since the virq is used before checking it for 0. Rearrange the code to make it better to understand. This, in particular, should clearly answer the question whether the caller needs to perform this check or not, and there are plenty of places for both variants, confirming a confusion. Fun fact that the new code is shorter: Function old new delta irq_dispose_mapping 278 271 -7 Total: Before=11625, After=11618, chg -0.06% when compiled by GCC on Debian for x86_64. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240405190105.3932034-1-andriy.shevchenko@linux.intel.com
* | Merge tag 'timers-core-2024-05-12' of ↵Linus Torvalds2024-05-146-104/+87
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: "Core code: - Make timekeeping and VDSO time readouts resilent against math overflow: In guest context the kernel is prone to math overflow when the host defers the timer interrupt due to overload, malfunction or malice. This can be mitigated by checking the clocksource delta for the maximum deferrement which is readily available. If that value is exceeded then the code uses a slowpath function which can handle the multiplication overflow. This functionality is enabled unconditionally in the kernel, but made conditional in the VDSO code. The latter is conditional because it allows architectures to optimize the check so it is not causing performance regressions. On X86 this is achieved by reworking the existing check for negative TSC deltas as a negative delta obviously exceeds the maximum deferrement when it is evaluated as an unsigned value. That avoids two conditionals in the hotpath and allows to hide both the negative delta and the large delta handling in the same slow path. - Add an initial minimal ktime_t abstraction for Rust - The usual boring cleanups and enhancements Drivers: - Boring updates to device trees and trivial enhancements in various drivers" * tag 'timers-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) clocksource/drivers/arm_arch_timer: Mark hisi_161010101_oem_info const clocksource/drivers/timer-ti-dm: Remove an unused field in struct dmtimer clocksource/drivers/renesas-ostm: Avoid reprobe after successful early probe clocksource/drivers/renesas-ostm: Allow OSTM driver to reprobe for RZ/V2H(P) SoC dt-bindings: timer: renesas: ostm: Document Renesas RZ/V2H(P) SoC rust: time: doc: Add missing C header links clocksource: Make the int help prompt unit readable in ncurses hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active() timerqueue: Remove never used function timerqueue_node_expires() rust: time: Add Ktime vdso: Fix powerpc build U64_MAX undeclared error clockevents: Convert s[n]printf() to sysfs_emit() clocksource: Convert s[n]printf() to sysfs_emit() clocksource: Make watchdog and suspend-timing multiplication overflow safe timekeeping: Let timekeeping_cycles_to_ns() handle both under and overflow timekeeping: Make delta calculation overflow safe timekeeping: Prepare timekeeping_cycles_to_ns() for overflow safety timekeeping: Fold in timekeeping_delta_to_ns() timekeeping: Consolidate timekeeping helpers timekeeping: Refactor timekeeping helpers ...
| * | clocksource: Make the int help prompt unit readable in ncursesBorislav Petkov (AMD)2024-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing make menuconfig and searching for the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US config item, the help says: │ Symbol: CLOCKSOURCE_WATCHDOG_MAX_SKEW_US [=125] │ Type : integer │ Range : [50 1000] │ Defined at kernel/time/Kconfig:204 │ Prompt: Clocksource watchdog maximum allowable skew (in s) ^^^ │ Depends on: GENERIC_CLOCKEVENTS [=y] && CLOCKSOURCE_WATCHDOG [=y] because on some terminals, it cannot display the 'μ' char, unicode number 0x3bc. So simply write it out so that there's no trouble. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240428102143.26764-1-bp@kernel.org
| * | hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active()Jiapeng Chong2024-04-221-13/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function hrtimer_hres_active() are defined in the hrtimer.c file, but not called elsewhere, so rename __hrtimer_hres_active() to hrtimer_hres_active() and remove the old hrtimer_hres_active() function. kernel/time/hrtimer.c:653:19: warning: unused function 'hrtimer_hres_active'. Fixes: 82ccdf062a64 ("hrtimer: Remove unused function") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/20240418023000.130324-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8778
| * | clockevents: Convert s[n]printf() to sysfs_emit()Li Zhijian2024-04-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. coccinelle complains that there are still a couple of functions that use snprintf(). Convert them to sysfs_emit(). Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240314100402.1326582-2-lizhijian@fujitsu.com
| * | clocksource: Convert s[n]printf() to sysfs_emit()Li Zhijian2024-04-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. coccinelle complains that there are still a couple of functions that use snprintf(). Convert them to sysfs_emit(). Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240314100402.1326582-1-lizhijian@fujitsu.com
| * | clocksource: Make watchdog and suspend-timing multiplication overflow safeAdrian Hunter2024-04-081-22/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel timekeeping is designed to keep the change in cycles (since the last timer interrupt) below max_cycles, which prevents multiplication overflow when converting cycles to nanoseconds. However, if timer interrupts stop, the clocksource_cyc2ns() calculation will eventually overflow. Add protection against that. Simplify by folding together clocksource_delta() and clocksource_cyc2ns() into cycles_to_nsec_safe(). Check against max_cycles, falling back to a slower higher precision calculation. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-20-adrian.hunter@intel.com
| * | timekeeping: Let timekeeping_cycles_to_ns() handle both under and overflowAdrian Hunter2024-04-081-18/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the case !CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE, forego overflow protection in the range (mask << 1) < delta <= mask, and interpret it always as an inconsistency between CPU clock values. That allows slightly neater code, and it is on a slow path so has no effect on performance. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-19-adrian.hunter@intel.com
| * | timekeeping: Make delta calculation overflow safeAdrian Hunter2024-04-081-11/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel timekeeping is designed to keep the change in cycles (since the last timer interrupt) below max_cycles, which prevents multiplication overflow when converting cycles to nanoseconds. However, if timer interrupts stop, the calculation will eventually overflow. Add protection against that. In timekeeping_cycles_to_ns() calculation, check against max_cycles, falling back to a slower higher precision calculation. In timekeeping_forward_now(), process delta in chunks of at most max_cycles. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-18-adrian.hunter@intel.com
| * | timekeeping: Prepare timekeeping_cycles_to_ns() for overflow safetyAdrian Hunter2024-04-081-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Open code clocksource_delta() in timekeeping_cycles_to_ns() so that overflow safety can be added efficiently. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-17-adrian.hunter@intel.com
| * | timekeeping: Fold in timekeeping_delta_to_ns()Adrian Hunter2024-04-081-12/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | timekeeping_delta_to_ns() is now called only from timekeeping_cycles_to_ns(), and it is not useful otherwise. Simplify the code by folding it into timekeeping_cycles_to_ns(). No functional change. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-16-adrian.hunter@intel.com
| * | timekeeping: Consolidate timekeeping helpersAdrian Hunter2024-04-081-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consolidate timekeeping helpers, making use of timekeeping_cycles_to_ns() in preference to directly using timekeeping_delta_to_ns(). No functional change. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-15-adrian.hunter@intel.com
| * | timekeeping: Refactor timekeeping helpersAdrian Hunter2024-04-081-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify the usage of timekeeping sanity checking, in preparation for consolidating timekeeping helpers. This works towards eliminating timekeeping_delta_to_ns() in favour of timekeeping_cycles_to_ns(). No functional change. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-14-adrian.hunter@intel.com
| * | timekeeping: Reuse timekeeping_cycles_to_ns()Adrian Hunter2024-04-081-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify __timekeeping_get_ns() by reusing timekeeping_cycles_to_ns(). No functional change. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-13-adrian.hunter@intel.com
| * | timekeeping: Tidy timekeeping_cycles_to_ns() slightlyAdrian Hunter2024-04-081-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Put together declaration and initialization of the local variable 'delta'. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-12-adrian.hunter@intel.com
| * | timekeeping: Rename fast_tk_get_delta_ns() to __timekeeping_get_ns()Adrian Hunter2024-04-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename fast_tk_get_delta_ns() to __timekeeping_get_ns() to prepare for its reuse as a general timekeeping helper function. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-11-adrian.hunter@intel.com
| * | timekeeping: Move timekeeping helper functionsAdrian Hunter2024-04-081-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move timekeeping helper functions to prepare for their reuse. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-10-adrian.hunter@intel.com
| * | vdso: Add vdso_data:: Max_cyclesAdrian Hunter2024-04-081-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add vdso_data::max_cycles in preparation to use it to detect potential multiplication overflow. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-7-adrian.hunter@intel.com
| * | hrtimer: Remove unused functionJiapeng Chong2024-04-081-19/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function is defined, but not called anywhere: kernel/time/hrtimer.c:1880:20: warning: unused function '__hrtimer_peek_ahead_timers'. Remove it. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240322070441.29646-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8611
* | | Merge tag 'sched-core-2024-05-13' of ↵Linus Torvalds2024-05-1411-325/+381
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Add cpufreq pressure feedback for the scheduler - Rework misfit load-balancing wrt affinity restrictions - Clean up and simplify the code around ::overutilized and ::overload access. - Simplify sched_balance_newidle() - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES handling that changed the output. - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch() - Reorganize, clean up and unify most of the higher level scheduler balancing function names around the sched_balance_*() prefix - Simplify the balancing flag code (sched_balance_running) - Miscellaneous cleanups & fixes * tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/pelt: Remove shift of thermal clock sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() thermal/cpufreq: Remove arch_update_thermal_pressure() sched/cpufreq: Take cpufreq feedback into account cpufreq: Add a cpufreq pressure feedback for the scheduler sched/fair: Fix update of rd->sg_overutilized sched/vtime: Do not include <asm/vtime.h> header s390/irq,nmi: Include <asm/vtime.h> header directly s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover sched/vtime: Get rid of generic vtime_task_switch() implementation sched/vtime: Remove confusing arch_vtime_task_switch() declaration sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized() sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded() sched/fair: Rename root_domain::overload to ::overloaded sched/fair: Use helper functions to access root_domain::overload sched/fair: Check root_domain::overload value before update sched/fair: Combine EAS check with root_domain::overutilized access sched/fair: Simplify the continue_balancing logic in sched_balance_newidle() ...
| * | | sched/pelt: Remove shift of thermal clockVincent Guittot2024-04-243-27/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The optional shift of the clock used by thermal/hw load avg has been introduced to handle case where the signal was not always a high frequency hw signal. Now that cpufreq provides a signal for firmware and SW pressure, we can remove this exception and always keep this PELT signal aligned with other signals. Mark sysctl_sched_migration_cost boot parameter as deprecated Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-6-vincent.guittot@linaro.org
| * | | sched/cpufreq: Rename arch_update_thermal_pressure() => ↵Vincent Guittot2024-04-245-34/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | arch_update_hw_pressure() Now that cpufreq provides a pressure value to the scheduler, rename arch_update_thermal_pressure into HW pressure to reflect that it returns a pressure applied by HW (i.e. with a high frequency change) and not always related to thermal mitigation but also generated by max current limitation as an example. Such high frequency signal needs filtering to be smoothed and provide an value that reflects the average available capacity into the scheduler time scale. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org
| * | | sched/cpufreq: Take cpufreq feedback into accountVincent Guittot2024-04-241-20/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Aggregate the different pressures applied on the capacity of CPUs and create a new function that returns the actual capacity of the CPU: get_actual_cpu_capacity(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20240326091616.3696851-3-vincent.guittot@linaro.org
| * | | sched/fair: Fix update of rd->sg_overutilizedVincent Guittot2024-04-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sg_overloaded is used instead of sg_overutilized to update rd->sg_overutilized. Fixes: 4475cd8bfd9b ("sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240404155738.2866102-1-vincent.guittot@linaro.org
| * | | sched/vtime: Get rid of generic vtime_task_switch() implementationAlexander Gordeev2024-04-171-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The generic vtime_task_switch() implementation gets built only if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an architecture to implement arch_vtime_task_switch() callback at the same time, which is confusing. Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC architecture only and vtime_task_switch() generic variant is rather superfluous. Simplify the whole vtime_task_switch() wiring by moving the existing generic implementation to PowerPC. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2cb6e3caada93623f6d4f78ad938ac6cd0e2fda8.1712760275.git.agordeev@linux.ibm.com
| * | | sched/balancing: Simplify the sg_status bitmask and use separate ↵Ingo Molnar2024-03-292-29/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ->overloaded and ->overutilized flags SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an unnecessary complication that only make the code harder to read and slower. We only ever set them separately: thule:~/tip> git grep SG_OVER kernel/sched/ kernel/sched/fair.c: set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED); kernel/sched/fair.c: *sg_status |= SG_OVERLOADED; kernel/sched/fair.c: *sg_status |= SG_OVERUTILIZED; kernel/sched/fair.c: *sg_status |= SG_OVERLOADED; kernel/sched/fair.c: set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED); kernel/sched/fair.c: sg_status & SG_OVERUTILIZED); kernel/sched/fair.c: } else if (sg_status & SG_OVERUTILIZED) { kernel/sched/fair.c: set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED); kernel/sched/sched.h:#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */ kernel/sched/sched.h:#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */ kernel/sched/sched.h: set_rd_overloaded(rq->rd, SG_OVERLOADED); And use them separately, which results in suboptimal code: /* update overload indicator if we are at root domain */ set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED); /* Update over-utilization (tipping point, U >= 0) indicator */ set_rd_overutilized_status(env->dst_rq->rd, Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats() and its lower level functions, and change all of them to 'bool'. Remove the now unused SG_OVERLOADED and SG_OVERUTILIZED flags. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Qais Yousef <qyousef@layalina.io> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVPhODZ8/nbsqbP@gmail.com
| * | | sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()Ingo Molnar2024-03-281-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The _status() postfix has no real meaning, simplify the naming and harmonize it with set_rd_overloaded(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
| * | | sched/fair: Rename SG_OVERLOAD to SG_OVERLOADEDIngo Molnar2024-03-282-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Follow the rename of the root_domain::overloaded flag. Note that this also matches the SG_OVERUTILIZED flag better. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
| * | | sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()Ingo Molnar2024-03-282-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Follow the rename of the root_domain::overloaded flag. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
| * | | sched/fair: Rename root_domain::overload to ::overloadedIngo Molnar2024-03-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is silly to use an ambiguous noun instead of a clear adjective when naming such a flag ... Note how root_domain::overutilized already used a proper adjective. rd->overloaded is now set to 1 when the root domain is overloaded. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
| * | | sched/fair: Use helper functions to access root_domain::overloadShrikanth Hegde2024-03-282-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce two helper functions to access & set the root_domain::overload flag: get_rd_overload() set_rd_overload() To make sure code is always following READ_ONCE()/WRITE_ONCE() access methods. No change in functionality intended. [ mingo: Renamed the accessors to get_/set_rd_overload(), tidied up the changelog. ] Suggested-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240325054505.201995-3-sshegde@linux.ibm.com
| * | | sched/fair: Check root_domain::overload value before updateShrikanth Hegde2024-03-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The root_domain::overload flag is 1 when there's any rq in the root domain that has 2 or more running tasks. (Ie. it's overloaded.) The root_domain structure itself is a global structure per cpuset island. The ::overload flag is maintained the following way: - Set when adding a second task to the runqueue. - It is cleared in update_sd_lb_stats() during load balance, if none of the rqs have 2 or more running tasks. This flag is used during newidle balance to see if its worth doing a full load balance pass, which can be an expensive operation. If it is set, then newidle balance will try to aggressively pull a task. Since commit: 630246a06ae2 ("sched/fair: Clean-up update_sg_lb_stats parameters") ::overload is being written unconditionally, even if it has the same value. The change in value of this depends on the workload, but on typical workloads, it doesn't change all that often: a system is either dominantly overloaded for substantial amounts of time, or not. Extra writes to this semi-global structure cause unnecessary overhead, extra bus traffic, etc. - so avoid it as much as possible. Perf probe stats show that it's worth making this change (numbers are with patch applied): 1M probe:sched_balance_newidle_L38 139 probe:update_sd_lb_stats_L53 <====== 1->0 writes 129K probe:add_nr_running_L12 74 probe:add_nr_running_L13 <====== 0->1 writes 54K probe:update_sd_lb_stats_L50 <====== reads These numbers prove that actual change in the ::overload value is (much) less frequent: L50 is much larger at ~54,000 accesses vs L53+L13 of 139+74. [ mingo: Rewrote the changelog. ] Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20240325054505.201995-2-sshegde@linux.ibm.com
| * | | sched/fair: Combine EAS check with root_domain::overutilized accessShrikanth Hegde2024-03-281-13/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Access to root_domainoverutilized is always used with sched_energy_enabled in the pattern: if (sched_energy_enabled && !overutilized) do something So modify the helper function to utilize this pattern. This is more readable code as it would say, do something when root domain is not overutilized. This function always return true when EAS is disabled. No change in functionality intended. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240326152616.380999-1-sshegde@linux.ibm.com
| * | | sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()Shrikanth Hegde2024-03-261-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | newidle(CPU_NEWLY_IDLE) balancing doesn't stop the load-balancing if the continue_balancing flag is reset, but the other two balancing (IDLE, BUSY) cases do that. newidle balance stops the load balancing if rq has a task or there is wakeup pending. The same checks are present in should_we_balance for newidle. Hence use the return value and simplify continue_balancing mechanism for newidle. Update the comment surrounding it as well. No change in functionality intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20240325153926.274284-1-sshegde@linux.ibm.com
| * | | sched/fair: Introduce is_rd_overutilized() helper function to access ↵Shrikanth Hegde2024-03-261-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | root_domain::overutilized The root_domain::overutilized field is READ_ONCE() accessed in multiple places, which could be simplified with a helper function. This might also make it more apparent that it needs to be used only in case of EAS. No change in functionality intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240307085725.444486-3-sshegde@linux.ibm.com