diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2016-01-13 05:25:09 +0100 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2016-01-13 05:25:09 +0100 |
commit | 67990608c8b95d2b8ccc29932376ae73d5818727 (patch) | |
tree | 760f66ff4a41d38d52acdbc65a526c45d1ef4f48 /Documentation/cpu-freq/intel-pstate.txt | |
parent | Merge tag 'trace-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/roste... (diff) | |
parent | Merge branch 'powercap' (diff) | |
download | linux-67990608c8b95d2b8ccc29932376ae73d5818727.tar.xz linux-67990608c8b95d2b8ccc29932376ae73d5818727.zip |
Merge tag 'pm+acpi-4.5-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull oower management and ACPI updates from Rafael Wysocki:
"As far as the number of commits goes, ACPICA takes the lead this time,
followed by cpufreq and the device properties framework changes.
The most significant new feature is the debugfs-based interface to the
ACPICA's AML debugger added in the previous cycle and a new user space
tool for accessing it.
On the cpufreq front, the core is updated to handle governors more
efficiently, particularly on systems where a single cpufreq policy
object is shared between multiple CPUs, and there are quite a few
changes in drivers (intel_pstate, cpufreq-dt etc).
The device properties framework is updated to handle built-in (ie
included in the kernel itself) device properties better, among other
things by adding a fallback mechanism that will allow drivers to
provide default properties to be used in case the plaform firmware
doesn't provide the properties expected by them.
The Operating Performance Points (OPP) framework gets new DT bindings
and debugfs support.
A new cpufreq driver for ST platforms is added and the ACPI driver for
AMD SoCs will now support the APM X-Gene ACPI I2C device.
The rest is mostly fixes and cleanups all over.
Specifics:
- Add a debugfs-based interface for interacting with the ACPICA's AML
debugger introduced in the previous cycle and a new user space tool
for that, fix some bugs related to the AML debugger and clean up
the code in question (Lv Zheng, Dan Carpenter, Colin Ian King,
Markus Elfring).
- Update ACPICA to upstream revision 20151218 including a number of
fixes and cleanups in the ACPICA core (Bob Moore, Lv Zheng, Labbe
Corentin, Prarit Bhargava, Colin Ian King, David E Box, Rafael
Wysocki).
In particular, the previously added erroneous support for the _SUB
object is dropped, the concatenate operator will support all ACPI
objects now, the Debug Object handling is improved, the SuperName
handling of parameters being control methods is fixed, the
ObjectType operator handling is updated to follow ACPI 5.0A and the
handling of CondRefOf and RefOf is updated accordingly, module-
level code will be executed after loading each ACPI table now
(instead of being run once after all tables containing AML have
been loaded), the Operation Region handlers management is updated
to fix some reported problems and a the ACPICA code in the kernel
is more in line with the upstream now.
- Update the ACPI backlight driver to provide information on whether
or not it will generate key-presses for brightness change hotkeys
and update some platform drivers (dell-wmi, thinkpad_acpi) to use
that information to avoid sending double key-events to users pace
for these, add new ACPI backlight quirks (Hans de Goede, Aaron Lu,
Adrien Schildknecht).
- Improve the ACPI handling of interrupt GPIOs (Christophe Ricard).
- Fix the handling of the list of device IDs of device objects found
in the ACPI namespace and add a helper for checking if there is a
device object for a given device ID (Lukas Wunner).
- Change the logic in the ACPI namespace scanning code to create
struct acpi_device objects for all ACPI device objects found in the
namespace even if _STA fails for them which helps to avoid device
enumeration problems on Microsoft Surface 3 (Aaron Lu).
- Add support for the APM X-Gene ACPI I2C device to the ACPI driver
for AMD SoCs (Loc Ho).
- Fix the long-standing issue with the DMA controller on Intel SoCs
where ACPI tables have no power management support for the DMA
controller itself, but it can be powered off automatically when the
last (other) device on the SoC is powered off via ACPI and clean up
the ACPI driver for Intel SoCs (acpi-lpss) after previous attempts
to fix that problem (Andy Shevchenko).
- Assorted ACPI fixes and cleanups (Andy Lutomirski, Colin Ian King,
Javier Martinez Canillas, Ken Xue, Mathias Krause, Rafael Wysocki,
Sinan Kaya).
- Update the device properties framework for better handling of
built-in properties, add support for built-in properties to the
platform bus type, update the MFD subsystem's handling of device
properties and add support for passing default configuration data
as device properties to the intel-lpss MFD drivers, convert the
designware I2C driver to use the unified device properties API and
add a fallback mechanism for using default built-in properties if
the platform firmware fails to provide the properties as expected
by drivers (Andy Shevchenko, Mika Westerberg, Heikki Krogerus,
Andrew Morton).
- Add new Device Tree bindings to the Operating Performance Points
(OPP) framework and update the exynos4412 DT binding accordingly,
introduce debugfs support for the OPP framework (Viresh Kumar,
Bartlomiej Zolnierkiewicz).
- Migrate the mt8173 cpufreq driver to the new OPP bindings (Pi-Cheng
Chen).
- Update the cpufreq core to make the handling of governors more
efficient, especially on systems where policy objects are shared
between multiple CPUs (Viresh Kumar, Rafael Wysocki).
- Fix cpufreq governor handling on configurations with
CONFIG_HZ_PERIODIC set (Chen Yu).
- Clean up the cpufreq core code related to the boost sysfs knob
support and update the ACPI cpufreq driver accordingly (Rafael
Wysocki).
- Add a new cpufreq driver for ST platforms and corresponding Device
Tree bindings (Lee Jones).
- Update the intel_pstate driver to allow the P-state selection
algorithm used by it to depend on the CPU ID of the processor it is
running on, make it use a special P-state selection algorithm (with
an IO wait time compensation tweak) on Atom CPUs based on the
Airmont and Silvermont cores so as to reduce their energy
consumption and improve intel_pstate documentation (Philippe
Longepe, Srinivas Pandruvada).
- Update the cpufreq-dt driver to support registering cooling devices
that use the (P * V^2 * f) dynamic power draw formula where V is
the voltage, f is the frequency and P is a constant coefficient
provided by Device Tree and update the arm_big_little cpufreq
driver to use that support (Punit Agrawal).
- Assorted cpufreq driver (cpufreq-dt, qoriq, pcc-cpufreq,
blackfin-cpufreq) updates (Andrzej Hajda, Hongtao Jia, Jacob
Tanenbaum, Markus Elfring).
- cpuidle core tweaks related to polling and measured_us calculation
(Rik van Riel).
- Removal of modularity from a few cpuidle drivers (clps711x, ux500,
exynos) that cannot be built as modules in practice (Paul
Gortmaker).
- PM core update to prevent devices from being probed during system
suspend/resume which is generally problematic and may lead to
inconsistent behavior (Grygorii Strashko).
- Assorted updates of the PM core and related code (Julia Lawall,
Manuel Pégourié-Gonnard, Maruthi Bayyavarapu, Rafael Wysocki, Ulf
Hansson).
- PNP bus type updates (Christophe Le Roy, Heiner Kallweit).
- PCI PM code cleanups (Jarkko Nikula, Julia Lawall).
- cpupower tool updates (Jacob Tanenbaum, Thomas Renninger)"
* tag 'pm+acpi-4.5-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (177 commits)
PM / clk: don't leave clocks enabled when driver not bound
i2c: dw: Add APM X-Gene ACPI I2C device support
ACPI / APD: Add APM X-Gene ACPI I2C device support
ACPI / LPSS: change 'does not have' to 'has' in comment
Revert "dmaengine: dw: platform: provide platform data for Intel"
dmaengine: dw: return immediately from IRQ when DMA isn't in use
dmaengine: dw: platform: power on device on shutdown
ACPI / LPSS: override power state for LPSS DMA device
PM / OPP: Use snprintf() instead of sprintf()
Documentation: cpufreq: intel_pstate: enhance documentation
ACPI, PCI, irq: remove redundant check for null string pointer
ACPI / video: driver must be registered before checking for keypresses
cpufreq-dt: fix handling regulator_get_voltage() result
cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC
PM / sleep: Add support for read-only sysfs attributes
ACPI: Fix white space in a structure definition
ACPI / SBS: fix inconsistent indenting inside if statement
PNP: respect PNP_DRIVER_RES_DO_NOT_CHANGE when detaching
ACPI / PNP: constify device IDs
ACPI / PCI: Simplify acpi_penalize_isa_irq()
...
Diffstat (limited to 'Documentation/cpu-freq/intel-pstate.txt')
-rw-r--r-- | Documentation/cpu-freq/intel-pstate.txt | 241 |
1 files changed, 199 insertions, 42 deletions
diff --git a/Documentation/cpu-freq/intel-pstate.txt b/Documentation/cpu-freq/intel-pstate.txt index be8d4006bf76..f7b12c071d53 100644 --- a/Documentation/cpu-freq/intel-pstate.txt +++ b/Documentation/cpu-freq/intel-pstate.txt @@ -1,61 +1,131 @@ -Intel P-state driver +Intel P-State driver -------------------- -This driver provides an interface to control the P state selection for -SandyBridge+ Intel processors. The driver can operate two different -modes based on the processor model, legacy mode and Hardware P state (HWP) -mode. - -In legacy mode, the Intel P-state implements two internal governors, -performance and powersave, that differ from the general cpufreq governors of -the same name (the general cpufreq governors implement target(), whereas the -internal Intel P-state governors implement setpolicy()). The internal -performance governor sets the max_perf_pct and min_perf_pct to 100; that is, -the governor selects the highest available P state to maximize the performance -of the core. The internal powersave governor selects the appropriate P state -based on the current load on the CPU. - -In HWP mode P state selection is implemented in the processor -itself. The driver provides the interfaces between the cpufreq core and -the processor to control P state selection based on user preferences -and reporting frequency to the cpufreq core. In this mode the -internal Intel P-state governor code is disabled. - -In addition to the interfaces provided by the cpufreq core for -controlling frequency the driver provides sysfs files for -controlling P state selection. These files have been added to -/sys/devices/system/cpu/intel_pstate/ - - max_perf_pct: limits the maximum P state that will be requested by - the driver stated as a percentage of the available performance. The - available (P states) performance may be reduced by the no_turbo +This driver provides an interface to control the P-State selection for the +SandyBridge+ Intel processors. + +The following document explains P-States: +http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf +As stated in the document, P-State doesn’t exactly mean a frequency. However, for +the sake of the relationship with cpufreq, P-State and frequency are used +interchangeably. + +Understanding the cpufreq core governors and policies are important before +discussing more details about the Intel P-State driver. Based on what callbacks +a cpufreq driver provides to the cpufreq core, it can support two types of +drivers: +- with target_index() callback: In this mode, the drivers using cpufreq core +simply provide the minimum and maximum frequency limits and an additional +interface target_index() to set the current frequency. The cpufreq subsystem +has a number of scaling governors ("performance", "powersave", "ondemand", +etc.). Depending on which governor is in use, cpufreq core will call for +transitions to a specific frequency using target_index() callback. +- setpolicy() callback: In this mode, drivers do not provide target_index() +callback, so cpufreq core can't request a transition to a specific frequency. +The driver provides minimum and maximum frequency limits and callbacks to set a +policy. The policy in cpufreq sysfs is referred to as the "scaling governor". +The cpufreq core can request the driver to operate in any of the two policies: +"performance: and "powersave". The driver decides which frequency to use based +on the above policy selection considering minimum and maximum frequency limits. + +The Intel P-State driver falls under the latter category, which implements the +setpolicy() callback. This driver decides what P-State to use based on the +requested policy from the cpufreq core. If the processor is capable of +selecting its next P-State internally, then the driver will offload this +responsibility to the processor (aka HWP: Hardware P-States). If not, the +driver implements algorithms to select the next P-State. + +Since these policies are implemented in the driver, they are not same as the +cpufreq scaling governors implementation, even if they have the same name in +the cpufreq sysfs (scaling_governors). For example the "performance" policy is +similar to cpufreq’s "performance" governor, but "powersave" is completely +different than the cpufreq "powersave" governor. The strategy here is similar +to cpufreq "ondemand", where the requested P-State is related to the system load. + +Sysfs Interface + +In addition to the frequency-controlling interfaces provided by the cpufreq +core, the driver provides its own sysfs files to control the P-State selection. +These files have been added to /sys/devices/system/cpu/intel_pstate/. +Any changes made to these files are applicable to all CPUs (even in a +multi-package system). + + max_perf_pct: Limits the maximum P-State that will be requested by + the driver. It states it as a percentage of the available performance. The + available (P-State) performance may be reduced by the no_turbo setting described below. - min_perf_pct: limits the minimum P state that will be requested by - the driver stated as a percentage of the max (non-turbo) + min_perf_pct: Limits the minimum P-State that will be requested by + the driver. It states it as a percentage of the max (non-turbo) performance level. - no_turbo: limits the driver to selecting P states below the turbo + no_turbo: Limits the driver to selecting P-State below the turbo frequency range. - turbo_pct: displays the percentage of the total performance that - is supported by hardware that is in the turbo range. This number + turbo_pct: Displays the percentage of the total performance that + is supported by hardware that is in the turbo range. This number is independent of whether turbo has been disabled or not. - num_pstates: displays the number of pstates that are supported - by hardware. This number is independent of whether turbo has + num_pstates: Displays the number of P-States that are supported + by hardware. This number is independent of whether turbo has been disabled or not. +For example, if a system has these parameters: + Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State) + Max non turbo ratio: 0x17 + Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio) + +Sysfs will show : + max_perf_pct:100, which corresponds to 1 core ratio + min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio + no_turbo:0, turbo is not disabled + num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1) + turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates + +Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual +Volume 3: System Programming Guide" to understand ratios. + +cpufreq sysfs for Intel P-State + +Since this driver registers with cpufreq, cpufreq sysfs is also presented. +There are some important differences, which need to be considered. + +scaling_cur_freq: This displays the real frequency which was used during +the last sample period instead of what is requested. Some other cpufreq driver, +like acpi-cpufreq, displays what is requested (Some changes are on the +way to fix this for acpi-cpufreq driver). The same is true for frequencies +displayed at /proc/cpuinfo. + +scaling_governor: This displays current active policy. Since each CPU has a +cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this +is not possible with Intel P-States, as there is one common policy for all +CPUs. Here, the last requested policy will be applicable to all CPUs. It is +suggested that one use the cpupower utility to change policy to all CPUs at the +same time. + +scaling_setspeed: This attribute can never be used with Intel P-State. + +scaling_max_freq/scaling_min_freq: This interface can be used similarly to +the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies +are converted to nearest possible P-State, this is prone to rounding errors. +This method is not preferred to limit performance. + +affected_cpus: Not used +related_cpus: Not used + For contemporary Intel processors, the frequency is controlled by the -processor itself and the P-states exposed to software are related to +processor itself and the P-State exposed to software is related to performance levels. The idea that frequency can be set to a single -frequency is fiction for Intel Core processors. Even if the scaling -driver selects a single P state the actual frequency the processor +frequency is fictional for Intel Core processors. Even if the scaling +driver selects a single P-State, the actual frequency the processor will run at is selected by the processor itself. -For legacy mode debugfs files have also been added to allow tuning of -the internal governor algorythm. These files are located at -/sys/kernel/debug/pstate_snb/ These files are NOT present in HWP mode. +Tuning Intel P-State driver + +When HWP mode is not used, debugfs files have also been added to allow the +tuning of the internal governor algorithm. These files are located at +/sys/kernel/debug/pstate_snb/. The algorithm uses a PID (Proportional +Integral Derivative) controller. The PID tunable parameters are: deadband d_gain_pct @@ -63,3 +133,90 @@ the internal governor algorythm. These files are located at p_gain_pct sample_rate_ms setpoint + +To adjust these parameters, some understanding of driver implementation is +necessary. There are some tweeks described here, but be very careful. Adjusting +them requires expert level understanding of power and performance relationship. +These limits are only useful when the "powersave" policy is active. + +-To make the system more responsive to load changes, sample_rate_ms can +be adjusted (current default is 10ms). +-To make the system use higher performance, even if the load is lower, setpoint +can be adjusted to a lower number. This will also lead to faster ramp up time +to reach the maximum P-State. +If there are no derivative and integral coefficients, The next P-State will be +equal to: + current P-State - ((setpoint - current cpu load) * p_gain_pct) + +For example, if the current PID parameters are (Which are defaults for the core +processors like SandyBridge): + deadband = 0 + d_gain_pct = 0 + i_gain_pct = 0 + p_gain_pct = 20 + sample_rate_ms = 10 + setpoint = 97 + +If the current P-State = 0x08 and current load = 100, this will result in the +next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State +goes up by only 1. If during next sample interval the current load doesn't +change and still 100, then P-State goes up by one again. This process will +continue as long as the load is more than the setpoint until the maximum P-State +is reached. + +For the same load at setpoint = 60, this will result in the next P-State += 0x08 - ((60 - 100) * 0.2) = 16 +So by changing the setpoint from 97 to 60, there is an increase of the +next P-State from 9 to 16. So this will make processor execute at higher +P-State for the same CPU load. If the load continues to be more than the +setpoint during next sample intervals, then P-State will go up again till the +maximum P-State is reached. But the ramp up time to reach the maximum P-State +will be much faster when the setpoint is 60 compared to 97. + +Debugging Intel P-State driver + +Event tracing +To debug P-State transition, the Linux event tracing interface can be used. +There are two specific events, which can be enabled (Provided the kernel +configs related to event tracing are enabled). + +# cd /sys/kernel/debug/tracing/ +# echo 1 > events/power/pstate_sample/enable +# echo 1 > events/power/cpu_frequency/enable +# cat trace +gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 + scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 + freq=2474476 +cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 + + +Using ftrace + +If function level tracing is required, the Linux ftrace interface can be used. +For example if we want to check how often a function to set a P-State is +called, we can set ftrace filter to intel_pstate_set_pstate. + +# cd /sys/kernel/debug/tracing/ +# cat available_filter_functions | grep -i pstate +intel_pstate_set_pstate +intel_pstate_cpu_init +... + +# echo intel_pstate_set_pstate > set_ftrace_filter +# echo function > current_tracer +# cat trace | head -15 +# tracer: function +# +# entries-in-buffer/entries-written: 80/80 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func + gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func + gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func + <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func |