diff options
Diffstat (limited to 'Documentation')
97 files changed, 4235 insertions, 949 deletions
diff --git a/Documentation/ABI/testing/sysfs-bus-mei b/Documentation/ABI/testing/sysfs-bus-mei new file mode 100644 index 000000000000..2066f0bbd453 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-mei @@ -0,0 +1,7 @@ +What: /sys/bus/mei/devices/.../modalias +Date: March 2013 +KernelVersion: 3.10 +Contact: Samuel Ortiz <sameo@linux.intel.com> + linux-mei@linux.intel.com +Description: Stores the same MODALIAS value emitted by uevent + Format: mei:<mei device name> diff --git a/Documentation/ABI/testing/sysfs-bus-usb b/Documentation/ABI/testing/sysfs-bus-usb index c8baaf53594a..f093e59cbe5f 100644 --- a/Documentation/ABI/testing/sysfs-bus-usb +++ b/Documentation/ABI/testing/sysfs-bus-usb @@ -32,7 +32,7 @@ Date: January 2008 KernelVersion: 2.6.25 Contact: Sarah Sharp <sarah.a.sharp@intel.com> Description: - If CONFIG_PM and CONFIG_USB_SUSPEND are enabled, then this file + If CONFIG_PM_RUNTIME is enabled then this file is present. When read, it returns the total time (in msec) that the USB device has been connected to the machine. This file is read-only. @@ -45,7 +45,7 @@ Date: January 2008 KernelVersion: 2.6.25 Contact: Sarah Sharp <sarah.a.sharp@intel.com> Description: - If CONFIG_PM and CONFIG_USB_SUSPEND are enabled, then this file + If CONFIG_PM_RUNTIME is enabled then this file is present. When read, it returns the total time (in msec) that the USB device has been active, i.e. not in a suspended state. This file is read-only. @@ -187,7 +187,7 @@ What: /sys/bus/usb/devices/.../power/usb2_hardware_lpm Date: September 2011 Contact: Andiry Xu <andiry.xu@amd.com> Description: - If CONFIG_USB_SUSPEND is set and a USB 2.0 lpm-capable device + If CONFIG_PM_RUNTIME is set and a USB 2.0 lpm-capable device is plugged in to a xHCI host which support link PM, it will perform a LPM test; if the test is passed and host supports USB2 hardware LPM (xHCI 1.0 feature), USB2 hardware LPM will diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 9c978dcae07d..2447698aed41 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -173,3 +173,15 @@ Description: Processor frequency boosting control Boosting allows the CPU and the firmware to run at a frequency beyound it's nominal limit. More details can be found in Documentation/cpu-freq/boost.txt + + +What: /sys/devices/system/cpu/cpu#/crash_notes + /sys/devices/system/cpu/cpu#/crash_notes_size +Date: April 2013 +Contact: kexec@lists.infradead.org +Description: address and size of the percpu note. + + crash_notes: the physical address of the memory that holds the + note of cpu#. + + crash_notes_size: size of the note of cpu#. diff --git a/Documentation/DocBook/device-drivers.tmpl b/Documentation/DocBook/device-drivers.tmpl index 7514dbf0a679..c36892c072da 100644 --- a/Documentation/DocBook/device-drivers.tmpl +++ b/Documentation/DocBook/device-drivers.tmpl @@ -227,7 +227,7 @@ X!Isound/sound_firmware.c <chapter id="uart16x50"> <title>16x50 UART Driver</title> !Edrivers/tty/serial/serial_core.c -!Edrivers/tty/serial/8250/8250.c +!Edrivers/tty/serial/8250/8250_core.c </chapter> <chapter id="fbdev"> diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 31ef8fe07f82..79e789b8b8ea 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -217,9 +217,14 @@ over a rather long period of time, but improvements are always welcome! whether the increased speed is worth it. 8. Although synchronize_rcu() is slower than is call_rcu(), it - usually results in simpler code. So, unless update performance - is critically important or the updaters cannot block, - synchronize_rcu() should be used in preference to call_rcu(). + usually results in simpler code. So, unless update performance is + critically important, the updaters cannot block, or the latency of + synchronize_rcu() is visible from userspace, synchronize_rcu() + should be used in preference to call_rcu(). Furthermore, + kfree_rcu() usually results in even simpler code than does + synchronize_rcu() without synchronize_rcu()'s multi-millisecond + latency. So please take advantage of kfree_rcu()'s "fire and + forget" memory-freeing capabilities where it applies. An especially important property of the synchronize_rcu() primitive is that it automatically self-limits: if grace periods @@ -268,7 +273,8 @@ over a rather long period of time, but improvements are always welcome! e. Periodically invoke synchronize_rcu(), permitting a limited number of updates per grace period. - The same cautions apply to call_rcu_bh() and call_rcu_sched(). + The same cautions apply to call_rcu_bh(), call_rcu_sched(), + call_srcu(), and kfree_rcu(). 9. All RCU list-traversal primitives, which include rcu_dereference(), list_for_each_entry_rcu(), and @@ -296,9 +302,9 @@ over a rather long period of time, but improvements are always welcome! all currently executing rcu_read_lock()-protected RCU read-side critical sections complete. It does -not- necessarily guarantee that all currently running interrupts, NMIs, preempt_disable() - code, or idle loops will complete. Therefore, if you do not have - rcu_read_lock()-protected read-side critical sections, do -not- - use synchronize_rcu(). + code, or idle loops will complete. Therefore, if your + read-side critical sections are protected by something other + than rcu_read_lock(), do -not- use synchronize_rcu(). Similarly, disabling preemption is not an acceptable substitute for rcu_read_lock(). Code that attempts to use preemption @@ -401,9 +407,9 @@ over a rather long period of time, but improvements are always welcome! read-side critical sections. It is the responsibility of the RCU update-side primitives to deal with this. -17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and - the __rcu sparse checks to validate your RCU code. These - can help find problems as follows: +17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the + __rcu sparse checks (enabled by CONFIG_SPARSE_RCU_POINTER) to + validate your RCU code. These can help find problems as follows: CONFIG_PROVE_RCU: check that accesses to RCU-protected data structures are carried out under the proper RCU diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.txt index a102d4b3724b..cd83d2348fef 100644 --- a/Documentation/RCU/lockdep.txt +++ b/Documentation/RCU/lockdep.txt @@ -64,6 +64,11 @@ checking of rcu_dereference() primitives: but retain the compiler constraints that prevent duplicating or coalescsing. This is useful when when testing the value of the pointer itself, for example, against NULL. + rcu_access_index(idx): + Return the value of the index and omit all barriers, but + retain the compiler constraints that prevent duplicating + or coalescsing. This is useful when when testing the + value of the index itself, for example, against -1. The rcu_dereference_check() check expression can be any boolean expression, but would normally include a lockdep expression. However, diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt index 38428c125135..2e319d1b9ef2 100644 --- a/Documentation/RCU/rcubarrier.txt +++ b/Documentation/RCU/rcubarrier.txt @@ -79,7 +79,20 @@ complete. Pseudo-code using rcu_barrier() is as follows: 2. Execute rcu_barrier(). 3. Allow the module to be unloaded. -The rcutorture module makes use of rcu_barrier in its exit function +There are also rcu_barrier_bh(), rcu_barrier_sched(), and srcu_barrier() +functions for the other flavors of RCU, and you of course must match +the flavor of rcu_barrier() with that of call_rcu(). If your module +uses multiple flavors of call_rcu(), then it must also use multiple +flavors of rcu_barrier() when unloading that module. For example, if +it uses call_rcu_bh(), call_srcu() on srcu_struct_1, and call_srcu() on +srcu_struct_2(), then the following three lines of code will be required +when unloading: + + 1 rcu_barrier_bh(); + 2 srcu_barrier(&srcu_struct_1); + 3 srcu_barrier(&srcu_struct_2); + +The rcutorture module makes use of rcu_barrier() in its exit function as follows: 1 static void diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index b336755b71ed..8e9359de1d28 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -92,14 +92,14 @@ If the CONFIG_RCU_CPU_STALL_INFO kernel configuration parameter is set, more information is printed with the stall-warning message, for example: INFO: rcu_preempt detected stall on CPU - 0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 + 0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 softirq=82/543 (t=65000 jiffies) In kernels with CONFIG_RCU_FAST_NO_HZ, even more information is printed: INFO: rcu_preempt detected stall on CPU - 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 drain=0 . timer not pending + 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D (t=65000 jiffies) The "(64628 ticks this GP)" indicates that this CPU has taken more @@ -116,13 +116,28 @@ number between the two "/"s is the value of the nesting, which will be a small positive number if in the idle loop and a very large positive number (as shown above) otherwise. -For CONFIG_RCU_FAST_NO_HZ kernels, the "drain=0" indicates that the CPU is -not in the process of trying to force itself into dyntick-idle state, the -"." indicates that the CPU has not given up forcing RCU into dyntick-idle -mode (it would be "H" otherwise), and the "timer not pending" indicates -that the CPU has not recently forced RCU into dyntick-idle mode (it -would otherwise indicate the number of microseconds remaining in this -forced state). +The "softirq=" portion of the message tracks the number of RCU softirq +handlers that the stalled CPU has executed. The number before the "/" +is the number that had executed since boot at the time that this CPU +last noted the beginning of a grace period, which might be the current +(stalled) grace period, or it might be some earlier grace period (for +example, if the CPU might have been in dyntick-idle mode for an extended +time period. The number after the "/" is the number that have executed +since boot until the current time. If this latter number stays constant +across repeated stall-warning messages, it is possible that RCU's softirq +handlers are no longer able to execute on this CPU. This can happen if +the stalled CPU is spinning with interrupts are disabled, or, in -rt +kernels, if a high-priority process is starving RCU's softirq handler. + +For CONFIG_RCU_FAST_NO_HZ kernels, the "last_accelerate:" prints the +low-order 16 bits (in hex) of the jiffies counter when this CPU last +invoked rcu_try_advance_all_cbs() from rcu_needs_cpu() or last invoked +rcu_accelerate_cbs() from rcu_prepare_for_idle(). The "nonlazy_posted:" +prints the number of non-lazy callbacks posted since the last call to +rcu_needs_cpu(). Finally, an "L" indicates that there are currently +no non-lazy callbacks ("." is printed otherwise, as shown above) and +"D" indicates that dyntick-idle processing is enabled ("." is printed +otherwise, for example, if disabled via the "nohz=" kernel boot parameter). Multiple Warnings From One Stall diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 0cc7820967f4..10df0b82f459 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -265,9 +265,9 @@ rcu_dereference() rcu_read_lock(); p = rcu_dereference(head.next); rcu_read_unlock(); - x = p->address; + x = p->address; /* BUG!!! */ rcu_read_lock(); - y = p->data; + y = p->data; /* BUG!!! */ rcu_read_unlock(); Holding a reference from one RCU read-side critical section diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index c379a2a6949f..6e97e73d87b5 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -60,8 +60,7 @@ own source tree. For example: "dontdiff" is a list of files which are generated by the kernel during the build process, and should be ignored in any diff(1)-generated patch. The "dontdiff" file is included in the kernel tree in -2.6.12 and later. For earlier kernel versions, you can get it -from <http://www.xenotime.net/linux/doc/dontdiff>. +2.6.12 and later. Make sure your patch does not include any extra files which do not belong in a patch submission. Make sure to review your patch -after- @@ -421,7 +420,7 @@ person it names. This tag documents that potentially interested parties have been included in the discussion -14) Using Reported-by:, Tested-by: and Reviewed-by: +14) Using Reported-by:, Tested-by:, Reviewed-by: and Suggested-by: If this patch fixes a problem reported by somebody else, consider adding a Reported-by: tag to credit the reporter for their contribution. Please @@ -469,6 +468,13 @@ done on the patch. Reviewed-by: tags, when supplied by reviewers known to understand the subject area and to perform thorough reviews, will normally increase the likelihood of your patch getting into the kernel. +A Suggested-by: tag indicates that the patch idea is suggested by the person +named and ensures credit to the person for the idea. Please note that this +tag should not be added without the reporter's permission, especially if the +idea was not posted in a public forum. That said, if we diligently credit our +idea reporters, they will, hopefully, be inspired to help us again in the +future. + 15) The canonical patch format diff --git a/Documentation/arm/sunxi/clocks.txt b/Documentation/arm/sunxi/clocks.txt new file mode 100644 index 000000000000..e09a88aa3136 --- /dev/null +++ b/Documentation/arm/sunxi/clocks.txt @@ -0,0 +1,56 @@ +Frequently asked questions about the sunxi clock system +======================================================= + +This document contains useful bits of information that people tend to ask +about the sunxi clock system, as well as accompanying ASCII art when adequate. + +Q: Why is the main 24MHz oscillator gatable? Wouldn't that break the + system? + +A: The 24MHz oscillator allows gating to save power. Indeed, if gated + carelessly the system would stop functioning, but with the right + steps, one can gate it and keep the system running. Consider this + simplified suspend example: + + While the system is operational, you would see something like + + 24MHz 32kHz + | + PLL1 + \ + \_ CPU Mux + | + [CPU] + + When you are about to suspend, you switch the CPU Mux to the 32kHz + oscillator: + + 24Mhz 32kHz + | | + PLL1 | + / + CPU Mux _/ + | + [CPU] + + Finally you can gate the main oscillator + + 32kHz + | + | + / + CPU Mux _/ + | + [CPU] + +Q: Were can I learn more about the sunxi clocks? + +A: The linux-sunxi wiki contains a page documenting the clock registers, + you can find it at + + http://linux-sunxi.org/A10/CCM + + The authoritative source for information at this time is the ccmu driver + released by Allwinner, you can find it at + + https://github.com/linux-sunxi/linux-sunxi/tree/sunxi-3.0/arch/arm/mach-sun4i/clock/ccmu diff --git a/Documentation/backlight/lp855x-driver.txt b/Documentation/backlight/lp855x-driver.txt index 18b06ca038ea..1c732f0c6758 100644 --- a/Documentation/backlight/lp855x-driver.txt +++ b/Documentation/backlight/lp855x-driver.txt @@ -32,14 +32,10 @@ Platform data for lp855x For supporting platform specific data, the lp855x platform data can be used. * name : Backlight driver name. If it is not defined, default name is set. -* mode : Brightness control mode. PWM or register based. * device_control : Value of DEVICE CONTROL register. * initial_brightness : Initial value of backlight brightness. * period_ns : Platform specific PWM period value. unit is nano. Only valid when brightness is pwm input mode. -* load_new_rom_data : - 0 : use default configuration data - 1 : update values of eeprom or eprom registers on loading driver * size_program : Total size of lp855x_rom_data. * rom_data : List of new eeprom/eprom registers. @@ -54,10 +50,8 @@ static struct lp855x_rom_data lp8552_eeprom_arr[] = { static struct lp855x_platform_data lp8552_pdata = { .name = "lcd-bl", - .mode = REGISTER_BASED, .device_control = I2C_CONFIG(LP8552), .initial_brightness = INITIAL_BRT, - .load_new_rom_data = 1, .size_program = ARRAY_SIZE(lp8552_eeprom_arr), .rom_data = lp8552_eeprom_arr, }; @@ -65,7 +59,6 @@ static struct lp855x_platform_data lp8552_pdata = { example 2) lp8556 platform data : pwm input mode with default rom data static struct lp855x_platform_data lp8556_pdata = { - .mode = PWM_BASED, .device_control = PWM_CONFIG(LP8556), .initial_brightness = INITIAL_BRT, .period_ns = 1000000, diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index bcf1a00b06a1..638bf17ff869 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -442,7 +442,7 @@ You can attach the current shell task by echoing 0: You can use the cgroup.procs file instead of the tasks file to move all threads in a threadgroup at once. Echoing the PID of any task in a threadgroup to cgroup.procs causes all tasks in that threadgroup to be -be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks +attached to the cgroup. Writing 0 to cgroup.procs moves all tasks in the writing task's threadgroup. Note: Since every task is always a member of exactly one cgroup in each @@ -580,6 +580,7 @@ propagation along the hierarchy. See the comment on cgroup_for_each_descendant_pre() for details. void css_offline(struct cgroup *cgrp); +(cgroup_mutex held by caller) This is the counterpart of css_online() and called iff css_online() has succeeded on @cgrp. This signifies the beginning of the end of diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt index 16624a7f8222..3c1095ca02ea 100644 --- a/Documentation/cgroups/devices.txt +++ b/Documentation/cgroups/devices.txt @@ -13,9 +13,7 @@ either an integer or * for all. Access is a composition of r The root device cgroup starts with rwm to 'all'. A child device cgroup gets a copy of the parent. Administrators can then remove devices from the whitelist or add new entries. A child cgroup can -never receive a device access which is denied by its parent. However -when a device access is removed from a parent it will not also be -removed from the child(ren). +never receive a device access which is denied by its parent. 2. User Interface @@ -50,3 +48,69 @@ task to a new cgroup. (Again we'll probably want to change that). A cgroup may not be granted more permissions than the cgroup's parent has. + +4. Hierarchy + +device cgroups maintain hierarchy by making sure a cgroup never has more +access permissions than its parent. Every time an entry is written to +a cgroup's devices.deny file, all its children will have that entry removed +from their whitelist and all the locally set whitelist entries will be +re-evaluated. In case one of the locally set whitelist entries would provide +more access than the cgroup's parent, it'll be removed from the whitelist. + +Example: + A + / \ + B + + group behavior exceptions + A allow "b 8:* rwm", "c 116:1 rw" + B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" + +If a device is denied in group A: + # echo "c 116:* r" > A/devices.deny +it'll propagate down and after revalidating B's entries, the whitelist entry +"c 116:2 rwm" will be removed: + + group whitelist entries denied devices + A all "b 8:* rwm", "c 116:* rw" + B "c 1:3 rwm", "b 3:* rwm" all the rest + +In case parent's exceptions change and local exceptions are not allowed +anymore, they'll be deleted. + +Notice that new whitelist entries will not be propagated: + A + / \ + B + + group whitelist entries denied devices + A "c 1:3 rwm", "c 1:5 r" all the rest + B "c 1:3 rwm", "c 1:5 r" all the rest + +when adding "c *:3 rwm": + # echo "c *:3 rwm" >A/devices.allow + +the result: + group whitelist entries denied devices + A "c *:3 rwm", "c 1:5 r" all the rest + B "c 1:3 rwm", "c 1:5 r" all the rest + +but now it'll be possible to add new entries to B: + # echo "c 2:3 rwm" >B/devices.allow + # echo "c 50:3 r" >B/devices.allow +or even + # echo "c *:3 rwm" >B/devices.allow + +Allowing or denying all by writing 'a' to devices.allow or devices.deny will +not be possible once the device cgroups has children. + +4.1 Hierarchy (internal implementation) + +device cgroups is implemented internally using a behavior (ALLOW, DENY) and a +list of exceptions. The internal state is controlled using the same user +interface to preserve compatibility with the previous whitelist-only +implementation. Removal or addition of exceptions that will reduce the access +to devices will be propagated down the hierarchy. +For every propagated exception, the effective rules will be re-evaluated based +on current parent's access rules. diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 8b8c28b9864c..f336ede58e62 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -40,6 +40,7 @@ Features: - soft limit - moving (recharging) account at moving a task is selectable. - usage threshold notifier + - memory pressure notifier - oom-killer disable knob and oom-notifier - Root cgroup has no limit controls. @@ -65,6 +66,7 @@ Brief summary of control files. memory.stat # show various statistics memory.use_hierarchy # set/show hierarchical account enabled memory.force_empty # trigger forced move charge to parent + memory.pressure_level # set memory pressure notifications memory.swappiness # set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) memory.move_charge_at_immigrate # set/show controls of moving charges @@ -762,7 +764,73 @@ At reading, current status of OOM is shown. under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.) -11. TODO +11. Memory Pressure + +The pressure level notifications can be used to monitor the memory +allocation cost; based on the pressure, applications can implement +different strategies of managing their memory resources. The pressure +levels are defined as following: + +The "low" level means that the system is reclaiming memory for new +allocations. Monitoring this reclaiming activity might be useful for +maintaining cache level. Upon notification, the program (typically +"Activity Manager") might analyze vmstat and act in advance (i.e. +prematurely shutdown unimportant services). + +The "medium" level means that the system is experiencing medium memory +pressure, the system might be making swap, paging out active file caches, +etc. Upon this event applications may decide to further analyze +vmstat/zoneinfo/memcg or internal memory usage statistics and free any +resources that can be easily reconstructed or re-read from a disk. + +The "critical" level means that the system is actively thrashing, it is +about to out of memory (OOM) or even the in-kernel OOM killer is on its +way to trigger. Applications should do whatever they can to help the +system. It might be too late to consult with vmstat or any other +statistics, so it's advisable to take an immediate action. + +The events are propagated upward until the event is handled, i.e. the +events are not pass-through. Here is what this means: for example you have +three cgroups: A->B->C. Now you set up an event listener on cgroups A, B +and C, and suppose group C experiences some pressure. In this situation, +only group C will receive the notification, i.e. groups A and B will not +receive it. This is done to avoid excessive "broadcasting" of messages, +which disturbs the system and which is especially bad if we are low on +memory or thrashing. So, organize the cgroups wisely, or propagate the +events manually (or, ask us to implement the pass-through events, +explaining why would you need them.) + +The file memory.pressure_level is only used to setup an eventfd. To +register a notification, an application must: + +- create an eventfd using eventfd(2); +- open memory.pressure_level; +- write string like "<event_fd> <fd of memory.pressure_level> <level>" + to cgroup.event_control. + +Application will be notified through eventfd when memory pressure is at +the specific level (or higher). Read/write operations to +memory.pressure_level are no implemented. + +Test: + + Here is a small script example that makes a new cgroup, sets up a + memory limit, sets up a notification in the cgroup and then makes child + cgroup experience a critical pressure: + + # cd /sys/fs/cgroup/memory/ + # mkdir foo + # cd foo + # cgroup_event_listener memory.pressure_level low & + # echo 8000000 > memory.limit_in_bytes + # echo 8000000 > memory.memsw.limit_in_bytes + # echo $$ > tasks + # dd if=/dev/zero | read x + + (Expect a bunch of notifications, and eventually, the oom-killer will + trigger.) + +12. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first diff --git a/Documentation/clk.txt b/Documentation/clk.txt index 1943fae014fd..b9911c27f496 100644 --- a/Documentation/clk.txt +++ b/Documentation/clk.txt @@ -174,9 +174,9 @@ int clk_foo_enable(struct clk_hw *hw) }; Below is a matrix detailing which clk_ops are mandatory based upon the -hardware capbilities of that clock. A cell marked as "y" means +hardware capabilities of that clock. A cell marked as "y" means mandatory, a cell marked as "n" implies that either including that -callback is invalid or otherwise uneccesary. Empty cells are either +callback is invalid or otherwise unnecessary. Empty cells are either optional or must be evaluated on a case-by-case basis. clock hardware characteristics @@ -231,3 +231,14 @@ To better enforce this policy, always follow this simple rule: any statically initialized clock data MUST be defined in a separate file from the logic that implements its ops. Basically separate the logic from the data and all is well. + + Part 6 - Disabling clock gating of unused clocks + +Sometimes during development it can be useful to be able to bypass the +default disabling of unused clocks. For example, if drivers aren't enabling +clocks properly but rely on them being on from the bootloader, bypassing +the disabling means that the driver will remain functional while the issues +are sorted out. + +To bypass this disabling, include "clk_ignore_unused" in the bootargs to the +kernel. diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt index 56fb62b09fc5..b428556197c9 100644 --- a/Documentation/device-mapper/dm-raid.txt +++ b/Documentation/device-mapper/dm-raid.txt @@ -30,6 +30,7 @@ The target is named "raid" and it accepts the following parameters: raid10 Various RAID10 inspired algorithms chosen by additional params - RAID10: Striped Mirrors (aka 'Striping on top of mirrors') - RAID1E: Integrated Adjacent Stripe Mirroring + - RAID1E: Integrated Offset Stripe Mirroring - and other similar RAID10 variants Reference: Chapter 4 of @@ -64,15 +65,15 @@ The target is named "raid" and it accepts the following parameters: synchronisation state for each region. [raid10_copies <# copies>] - [raid10_format near] + [raid10_format <near|far|offset>] These two options are used to alter the default layout of a RAID10 configuration. The number of copies is can be - specified, but the default is 2. There are other variations - to how the copies are laid down - the default and only current - option is "near". Near copies are what most people think of - with respect to mirroring. If these options are left - unspecified, or 'raid10_copies 2' and/or 'raid10_format near' - are given, then the layouts for 2, 3 and 4 devices are: + specified, but the default is 2. There are also three + variations to how the copies are laid down - the default + is "near". Near copies are what most people think of with + respect to mirroring. If these options are left unspecified, + or 'raid10_copies 2' and/or 'raid10_format near' are given, + then the layouts for 2, 3 and 4 devices are: 2 drives 3 drives 4 drives -------- ---------- -------------- A1 A1 A1 A1 A2 A1 A1 A2 A2 @@ -85,6 +86,33 @@ The target is named "raid" and it accepts the following parameters: 3-device layout is what might be called a 'RAID1E - Integrated Adjacent Stripe Mirroring'. + If 'raid10_copies 2' and 'raid10_format far', then the layouts + for 2, 3 and 4 devices are: + 2 drives 3 drives 4 drives + -------- -------------- -------------------- + A1 A2 A1 A2 A3 A1 A2 A3 A4 + A3 A4 A4 A5 A6 A5 A6 A7 A8 + A5 A6 A7 A8 A9 A9 A10 A11 A12 + .. .. .. .. .. .. .. .. .. + A2 A1 A3 A1 A2 A2 A1 A4 A3 + A4 A3 A6 A4 A5 A6 A5 A8 A7 + A6 A5 A9 A7 A8 A10 A9 A12 A11 + .. .. .. .. .. .. .. .. .. + + If 'raid10_copies 2' and 'raid10_format offset', then the + layouts for 2, 3 and 4 devices are: + 2 drives 3 drives 4 drives + -------- ------------ ----------------- + A1 A2 A1 A2 A3 A1 A2 A3 A4 + A2 A1 A3 A1 A2 A2 A1 A4 A3 + A3 A4 A4 A5 A6 A5 A6 A7 A8 + A4 A3 A6 A4 A5 A6 A5 A8 A7 + A5 A6 A7 A8 A9 A9 A10 A11 A12 + A6 A5 A9 A7 A8 A10 A9 A12 A11 + .. .. .. .. .. .. .. .. .. + Here we see layouts closely akin to 'RAID1E - Integrated + Offset Stripe Mirroring'. + <#raid_devs>: The number of devices composing the array. Each device consists of two entries. The first is the device containing the metadata (if any); the second is the one containing the @@ -142,3 +170,5 @@ Version History 1.3.0 Added support for RAID 10 1.3.1 Allow device replacement/rebuild for RAID 10 1.3.2 Fix/improve redundancy checking for RAID10 +1.4.0 Non-functional change. Removes arg from mapping function. +1.4.1 Add RAID10 "far" and "offset" algorithm support. diff --git a/Documentation/devicetree/bindings/arm/atmel-adc.txt b/Documentation/devicetree/bindings/arm/atmel-adc.txt index c63097d6afeb..16769d9cedd6 100644 --- a/Documentation/devicetree/bindings/arm/atmel-adc.txt +++ b/Documentation/devicetree/bindings/arm/atmel-adc.txt @@ -14,9 +14,19 @@ Required properties: - atmel,adc-status-register: Offset of the Interrupt Status Register - atmel,adc-trigger-register: Offset of the Trigger Register - atmel,adc-vref: Reference voltage in millivolts for the conversions + - atmel,adc-res: List of resolution in bits supported by the ADC. List size + must be two at least. + - atmel,adc-res-names: Contains one identifier string for each resolution + in atmel,adc-res property. "lowres" and "highres" + identifiers are required. Optional properties: - atmel,adc-use-external: Boolean to enable of external triggers + - atmel,adc-use-res: String corresponding to an identifier from + atmel,adc-res-names property. If not specified, the highest + resolution will be used. + - atmel,adc-sleep-mode: Boolean to enable sleep mode when no conversion + - atmel,adc-sample-hold-time: Sample and Hold Time in microseconds Optional trigger Nodes: - Required properties: @@ -40,6 +50,9 @@ adc0: adc@fffb0000 { atmel,adc-trigger-register = <0x08>; atmel,adc-use-external; atmel,adc-vref = <3300>; + atmel,adc-res = <8 10>; + atmel,adc-res-names = "lowres", "highres"; + atmel,adc-use-res = "lowres"; trigger@0 { trigger-name = "external-rising"; diff --git a/Documentation/devicetree/bindings/arm/msm/ssbi.txt b/Documentation/devicetree/bindings/arm/msm/ssbi.txt new file mode 100644 index 000000000000..54fd5ced3401 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/msm/ssbi.txt @@ -0,0 +1,18 @@ +* Qualcomm SSBI + +Some Qualcomm MSM devices contain a point-to-point serial bus used to +communicate with a limited range of devices (mostly power management +chips). + +These require the following properties: + +- compatible: "qcom,ssbi" + +- qcom,controller-type + indicates the SSBI bus variant the controller should use to talk + with the slave device. This should be one of "ssbi", "ssbi2", or + "pmic-arbiter". The type chosen is determined by the attached + slave. + +The slave device should be the single child node of the ssbi device +with a compatible field. diff --git a/Documentation/devicetree/bindings/arm/samsung/exynos-adc.txt b/Documentation/devicetree/bindings/arm/samsung/exynos-adc.txt new file mode 100644 index 000000000000..47ada1dff216 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/samsung/exynos-adc.txt @@ -0,0 +1,60 @@ +Samsung Exynos Analog to Digital Converter bindings + +The devicetree bindings are for the new ADC driver written for +Exynos4 and upward SoCs from Samsung. + +New driver handles the following +1. Supports ADC IF found on EXYNOS4412/EXYNOS5250 + and future SoCs from Samsung +2. Add ADC driver under iio/adc framework +3. Also adds the Documentation for device tree bindings + +Required properties: +- compatible: Must be "samsung,exynos-adc-v1" + for exynos4412/5250 controllers. + Must be "samsung,exynos-adc-v2" for + future controllers. +- reg: Contains ADC register address range (base address and + length) and the address of the phy enable register. +- interrupts: Contains the interrupt information for the timer. The + format is being dependent on which interrupt controller + the Samsung device uses. +- #io-channel-cells = <1>; As ADC has multiple outputs +- clocks From common clock binding: handle to adc clock. +- clock-names From common clock binding: Shall be "adc". +- vdd-supply VDD input supply. + +Note: child nodes can be added for auto probing from device tree. + +Example: adding device info in dtsi file + +adc: adc@12D10000 { + compatible = "samsung,exynos-adc-v1"; + reg = <0x12D10000 0x100>, <0x10040718 0x4>; + interrupts = <0 106 0>; + #io-channel-cells = <1>; + io-channel-ranges; + + clocks = <&clock 303>; + clock-names = "adc"; + + vdd-supply = <&buck5_reg>; +}; + + +Example: Adding child nodes in dts file + +adc@12D10000 { + + /* NTC thermistor is a hwmon device */ + ncp15wb473@0 { + compatible = "ntc,ncp15wb473"; + pullup-uV = <1800000>; + pullup-ohm = <47000>; + pulldown-ohm = <0>; + io-channels = <&adc 4>; + }; +}; + +Note: Does not apply to ADC driver under arch/arm/plat-samsung/ +Note: The child node can be added under the adc node or separately. diff --git a/Documentation/devicetree/bindings/clock/axi-clkgen.txt b/Documentation/devicetree/bindings/clock/axi-clkgen.txt new file mode 100644 index 000000000000..028b493e97ff --- /dev/null +++ b/Documentation/devicetree/bindings/clock/axi-clkgen.txt @@ -0,0 +1,22 @@ +Binding for the axi-clkgen clock generator + +This binding uses the common clock binding[1]. + +[1] Documentation/devicetree/bindings/clock/clock-bindings.txt + +Required properties: +- compatible : shall be "adi,axi-clkgen". +- #clock-cells : from common clock binding; Should always be set to 0. +- reg : Address and length of the axi-clkgen register set. +- clocks : Phandle and clock specifier for the parent clock. + +Optional properties: +- clock-output-names : From common clock binding. + +Example: + clock@0xff000000 { + compatible = "adi,axi-clkgen"; + #clock-cells = <0>; + reg = <0xff000000 0x1000>; + clocks = <&osc 1>; + }; diff --git a/Documentation/devicetree/bindings/clock/fixed-factor-clock.txt b/Documentation/devicetree/bindings/clock/fixed-factor-clock.txt new file mode 100644 index 000000000000..5757f9abfc26 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/fixed-factor-clock.txt @@ -0,0 +1,24 @@ +Binding for simple fixed factor rate clock sources. + +This binding uses the common clock binding[1]. + +[1] Documentation/devicetree/bindings/clock/clock-bindings.txt + +Required properties: +- compatible : shall be "fixed-factor-clock". +- #clock-cells : from common clock binding; shall be set to 0. +- clock-div: fixed divider. +- clock-mult: fixed multiplier. +- clocks: parent clock. + +Optional properties: +- clock-output-names : From common clock binding. + +Example: + clock { + compatible = "fixed-factor-clock"; + clocks = <&parentclk>; + #clock-cells = <0>; + div = <2>; + mult = <1>; + }; diff --git a/Documentation/devicetree/bindings/clock/silabs,si5351.txt b/Documentation/devicetree/bindings/clock/silabs,si5351.txt new file mode 100644 index 000000000000..cc374651662c --- /dev/null +++ b/Documentation/devicetree/bindings/clock/silabs,si5351.txt @@ -0,0 +1,114 @@ +Binding for Silicon Labs Si5351a/b/c programmable i2c clock generator. + +Reference +[1] Si5351A/B/C Data Sheet + http://www.silabs.com/Support%20Documents/TechnicalDocs/Si5351.pdf + +The Si5351a/b/c are programmable i2c clock generators with upto 8 output +clocks. Si5351a also has a reduced pin-count package (MSOP10) where only +3 output clocks are accessible. The internal structure of the clock +generators can be found in [1]. + +==I2C device node== + +Required properties: +- compatible: shall be one of "silabs,si5351{a,a-msop,b,c}". +- reg: i2c device address, shall be 0x60 or 0x61. +- #clock-cells: from common clock binding; shall be set to 1. +- clocks: from common clock binding; list of parent clock + handles, shall be xtal reference clock or xtal and clkin for + si5351c only. +- #address-cells: shall be set to 1. +- #size-cells: shall be set to 0. + +Optional properties: +- silabs,pll-source: pair of (number, source) for each pll. Allows + to overwrite clock source of pll A (number=0) or B (number=1). + +==Child nodes== + +Each of the clock outputs can be overwritten individually by +using a child node to the I2C device node. If a child node for a clock +output is not set, the eeprom configuration is not overwritten. + +Required child node properties: +- reg: number of clock output. + +Optional child node properties: +- silabs,clock-source: source clock of the output divider stage N, shall be + 0 = multisynth N + 1 = multisynth 0 for output clocks 0-3, else multisynth4 + 2 = xtal + 3 = clkin (si5351c only) +- silabs,drive-strength: output drive strength in mA, shall be one of {2,4,6,8}. +- silabs,multisynth-source: source pll A(0) or B(1) of corresponding multisynth + divider. +- silabs,pll-master: boolean, multisynth can change pll frequency. + +==Example== + +/* 25MHz reference crystal */ +ref25: ref25M { + compatible = "fixed-clock"; + #clock-cells = <0>; + clock-frequency = <25000000>; +}; + +i2c-master-node { + + /* Si5351a msop10 i2c clock generator */ + si5351a: clock-generator@60 { + compatible = "silabs,si5351a-msop"; + reg = <0x60>; + #address-cells = <1>; + #size-cells = <0>; + #clock-cells = <1>; + + /* connect xtal input to 25MHz reference */ + clocks = <&ref25>; + + /* connect xtal input as source of pll0 and pll1 */ + silabs,pll-source = <0 0>, <1 0>; + + /* + * overwrite clkout0 configuration with: + * - 8mA output drive strength + * - pll0 as clock source of multisynth0 + * - multisynth0 as clock source of output divider + * - multisynth0 can change pll0 + * - set initial clock frequency of 74.25MHz + */ + clkout0 { + reg = <0>; + silabs,drive-strength = <8>; + silabs,multisynth-source = <0>; + silabs,clock-source = <0>; + silabs,pll-master; + clock-frequency = <74250000>; + }; + + /* + * overwrite clkout1 configuration with: + * - 4mA output drive strength + * - pll1 as clock source of multisynth1 + * - multisynth1 as clock source of output divider + * - multisynth1 can change pll1 + */ + clkout1 { + reg = <1>; + silabs,drive-strength = <4>; + silabs,multisynth-source = <1>; + silabs,clock-source = <0>; + pll-master; + }; + + /* + * overwrite clkout2 configuration with: + * - xtal as clock source of output divider + */ + clkout2 { + reg = <2>; + silabs,clock-source = <2>; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/clock/sunxi.txt b/Documentation/devicetree/bindings/clock/sunxi.txt new file mode 100644 index 000000000000..729f52426fe1 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/sunxi.txt @@ -0,0 +1,151 @@ +Device Tree Clock bindings for arch-sunxi + +This binding uses the common clock binding[1]. + +[1] Documentation/devicetree/bindings/clock/clock-bindings.txt + +Required properties: +- compatible : shall be one of the following: + "allwinner,sun4i-osc-clk" - for a gatable oscillator + "allwinner,sun4i-pll1-clk" - for the main PLL clock + "allwinner,sun4i-cpu-clk" - for the CPU multiplexer clock + "allwinner,sun4i-axi-clk" - for the AXI clock + "allwinner,sun4i-axi-gates-clk" - for the AXI gates + "allwinner,sun4i-ahb-clk" - for the AHB clock + "allwinner,sun4i-ahb-gates-clk" - for the AHB gates + "allwinner,sun4i-apb0-clk" - for the APB0 clock + "allwinner,sun4i-apb0-gates-clk" - for the APB0 gates + "allwinner,sun4i-apb1-clk" - for the APB1 clock + "allwinner,sun4i-apb1-mux-clk" - for the APB1 clock muxing + "allwinner,sun4i-apb1-gates-clk" - for the APB1 gates + +Required properties for all clocks: +- reg : shall be the control register address for the clock. +- clocks : shall be the input parent clock(s) phandle for the clock +- #clock-cells : from common clock binding; shall be set to 0 except for + "allwinner,sun4i-*-gates-clk" where it shall be set to 1 + +Additionally, "allwinner,sun4i-*-gates-clk" clocks require: +- clock-output-names : the corresponding gate names that the clock controls + +For example: + +osc24M: osc24M@01c20050 { + #clock-cells = <0>; + compatible = "allwinner,sun4i-osc-clk"; + reg = <0x01c20050 0x4>; + clocks = <&osc24M_fixed>; +}; + +pll1: pll1@01c20000 { + #clock-cells = <0>; + compatible = "allwinner,sun4i-pll1-clk"; + reg = <0x01c20000 0x4>; + clocks = <&osc24M>; +}; + +cpu: cpu@01c20054 { + #clock-cells = <0>; + compatible = "allwinner,sun4i-cpu-clk"; + reg = <0x01c20054 0x4>; + clocks = <&osc32k>, <&osc24M>, <&pll1>; +}; + + + +Gate clock outputs + +The "allwinner,sun4i-*-gates-clk" clocks provide several gatable outputs; +their corresponding offsets as present on sun4i are listed below. Note that +some of these gates are not present on sun5i. + + * AXI gates ("allwinner,sun4i-axi-gates-clk") + + DRAM 0 + + * AHB gates ("allwinner,sun4i-ahb-gates-clk") + + USB0 0 + EHCI0 1 + OHCI0 2* + EHCI1 3 + OHCI1 4* + SS 5 + DMA 6 + BIST 7 + MMC0 8 + MMC1 9 + MMC2 10 + MMC3 11 + MS 12** + NAND 13 + SDRAM 14 + + ACE 16 + EMAC 17 + TS 18 + + SPI0 20 + SPI1 21 + SPI2 22 + SPI3 23 + PATA 24 + SATA 25** + GPS 26* + + VE 32 + TVD 33 + TVE0 34 + TVE1 35 + LCD0 36 + LCD1 37 + + CSI0 40 + CSI1 41 + + HDMI 43 + DE_BE0 44 + DE_BE1 45 + DE_FE0 46 + DE_FE1 47 + + MP 50 + + MALI400 52 + + * APB0 gates ("allwinner,sun4i-apb0-gates-clk") + + CODEC 0 + SPDIF 1* + AC97 2 + IIS 3 + + PIO 5 + IR0 6 + IR1 7 + + KEYPAD 10 + + * APB1 gates ("allwinner,sun4i-apb1-gates-clk") + + I2C0 0 + I2C1 1 + I2C2 2 + + CAN 4 + SCR 5 + PS20 6 + PS21 7 + + UART0 16 + UART1 17 + UART2 18 + UART3 19 + UART4 20 + UART5 21 + UART6 22 + UART7 23 + +Notation: + [*]: The datasheet didn't mention these, but they are present on AW code + [**]: The datasheet had this marked as "NC" but they are used on AW code diff --git a/Documentation/devicetree/bindings/gpio/gpio.txt b/Documentation/devicetree/bindings/gpio/gpio.txt index a33628759d36..d933af370697 100644 --- a/Documentation/devicetree/bindings/gpio/gpio.txt +++ b/Documentation/devicetree/bindings/gpio/gpio.txt @@ -98,7 +98,7 @@ announce the pinrange to the pin ctrl subsystem. For example, compatible = "fsl,qe-pario-bank-e", "fsl,qe-pario-bank"; reg = <0x1460 0x18>; gpio-controller; - gpio-ranges = <&pinctrl1 20 10>, <&pinctrl2 50 20>; + gpio-ranges = <&pinctrl1 0 20 10>, <&pinctrl2 10 50 20>; } @@ -107,8 +107,8 @@ where, Next values specify the base pin and number of pins for the range handled by 'qe_pio_e' gpio. In the given example from base pin 20 to - pin 29 under pinctrl1 and pin 50 to pin 69 under pinctrl2 is handled - by this gpio controller. + pin 29 under pinctrl1 with gpio offset 0 and pin 50 to pin 69 under + pinctrl2 with gpio offset 10 is handled by this gpio controller. The pinctrl node must have "#gpio-range-cells" property to show number of arguments to pass with phandle from gpio controllers node. diff --git a/Documentation/devicetree/bindings/hwmon/ntc_thermistor.txt b/Documentation/devicetree/bindings/hwmon/ntc_thermistor.txt new file mode 100644 index 000000000000..c6f66674f19c --- /dev/null +++ b/Documentation/devicetree/bindings/hwmon/ntc_thermistor.txt @@ -0,0 +1,29 @@ +NTC Thermistor hwmon sensors +------------------------------- + +Requires node properties: +- "compatible" value : one of + "ntc,ncp15wb473" + "ntc,ncp18wb473" + "ntc,ncp21wb473" + "ntc,ncp03wb473" + "ntc,ncp15wl333" +- "pullup-uv" Pull up voltage in micro volts +- "pullup-ohm" Pull up resistor value in ohms +- "pulldown-ohm" Pull down resistor value in ohms +- "connected-positive" Always ON, If not specified. + Status change is possible. +- "io-channels" Channel node of ADC to be used for + conversion. + +Read more about iio bindings at + Documentation/devicetree/bindings/iio/iio-bindings.txt + +Example: + ncp15wb473@0 { + compatible = "ntc,ncp15wb473"; + pullup-uv = <1800000>; + pullup-ohm = <47000>; + pulldown-ohm = <0>; + io-channels = <&adc 3>; + }; diff --git a/Documentation/devicetree/bindings/iio/iio-bindings.txt b/Documentation/devicetree/bindings/iio/iio-bindings.txt new file mode 100644 index 000000000000..0b447d9ad196 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/iio-bindings.txt @@ -0,0 +1,97 @@ +This binding is derived from clock bindings, and based on suggestions +from Lars-Peter Clausen [1]. + +Sources of IIO channels can be represented by any node in the device +tree. Those nodes are designated as IIO providers. IIO consumer +nodes use a phandle and IIO specifier pair to connect IIO provider +outputs to IIO inputs. Similar to the gpio specifiers, an IIO +specifier is an array of one or more cells identifying the IIO +output on a device. The length of an IIO specifier is defined by the +value of a #io-channel-cells property in the IIO provider node. + +[1] http://marc.info/?l=linux-iio&m=135902119507483&w=2 + +==IIO providers== + +Required properties: +#io-channel-cells: Number of cells in an IIO specifier; Typically 0 for nodes + with a single IIO output and 1 for nodes with multiple + IIO outputs. + +Example for a simple configuration with no trigger: + + adc: voltage-sensor@35 { + compatible = "maxim,max1139"; + reg = <0x35>; + #io-channel-cells = <1>; + }; + +Example for a configuration with trigger: + + adc@35 { + compatible = "some-vendor,some-adc"; + reg = <0x35>; + + adc1: iio-device@0 { + #io-channel-cells = <1>; + /* other properties */ + }; + adc2: iio-device@1 { + #io-channel-cells = <1>; + /* other properties */ + }; + }; + +==IIO consumers== + +Required properties: +io-channels: List of phandle and IIO specifier pairs, one pair + for each IIO input to the device. Note: if the + IIO provider specifies '0' for #io-channel-cells, + then only the phandle portion of the pair will appear. + +Optional properties: +io-channel-names: + List of IIO input name strings sorted in the same + order as the io-channels property. Consumers drivers + will use io-channel-names to match IIO input names + with IIO specifiers. +io-channel-ranges: + Empty property indicating that child nodes can inherit named + IIO channels from this node. Useful for bus nodes to provide + and IIO channel to their children. + +For example: + + device { + io-channels = <&adc 1>, <&ref 0>; + io-channel-names = "vcc", "vdd"; + }; + +This represents a device with two IIO inputs, named "vcc" and "vdd". +The vcc channel is connected to output 1 of the &adc device, and the +vdd channel is connected to output 0 of the &ref device. + +==Example== + + adc: max1139@35 { + compatible = "maxim,max1139"; + reg = <0x35>; + #io-channel-cells = <1>; + }; + + ... + + iio_hwmon { + compatible = "iio-hwmon"; + io-channels = <&adc 0>, <&adc 1>, <&adc 2>, + <&adc 3>, <&adc 4>, <&adc 5>, + <&adc 6>, <&adc 7>, <&adc 8>, + <&adc 9>; + }; + + some_consumer { + compatible = "some-consumer"; + io-channels = <&adc 10>, <&adc 11>; + io-channel-names = "adc1", "adc2"; + }; diff --git a/Documentation/devicetree/bindings/media/coda.txt b/Documentation/devicetree/bindings/media/coda.txt new file mode 100644 index 000000000000..2865d04e4030 --- /dev/null +++ b/Documentation/devicetree/bindings/media/coda.txt @@ -0,0 +1,30 @@ +Chips&Media Coda multi-standard codec IP +======================================== + +Coda codec IPs are present in i.MX SoCs in various versions, +called VPU (Video Processing Unit). + +Required properties: +- compatible : should be "fsl,<chip>-src" for i.MX SoCs: + (a) "fsl,imx27-vpu" for CodaDx6 present in i.MX27 + (b) "fsl,imx53-vpu" for CODA7541 present in i.MX53 + (c) "fsl,imx6q-vpu" for CODA960 present in i.MX6q +- reg: should be register base and length as documented in the + SoC reference manual +- interrupts : Should contain the VPU interrupt. For CODA960, + a second interrupt is needed for the MJPEG unit. +- clocks : Should contain the ahb and per clocks, in the order + determined by the clock-names property. +- clock-names : Should be "ahb", "per" +- iram : phandle pointing to the SRAM device node + +Example: + +vpu: vpu@63ff4000 { + compatible = "fsl,imx53-vpu"; + reg = <0x63ff4000 0x1000>; + interrupts = <9>; + clocks = <&clks 63>, <&clks 63>; + clock-names = "ahb", "per"; + iram = <&ocram>; +}; diff --git a/Documentation/devicetree/bindings/mfd/ab8500.txt b/Documentation/devicetree/bindings/mfd/ab8500.txt index 13b707b7355c..c3a14e0ad0ad 100644 --- a/Documentation/devicetree/bindings/mfd/ab8500.txt +++ b/Documentation/devicetree/bindings/mfd/ab8500.txt @@ -13,9 +13,6 @@ Required parent device properties: 4 = active high level-sensitive 8 = active low level-sensitive -Optional parent device properties: -- reg : contains the PRCMU mailbox address for the AB8500 i2c port - The AB8500 consists of a large and varied group of sub-devices: Device IRQ Names Supply Names Description @@ -86,9 +83,8 @@ Non-standard child device properties: - stericsson,amic2-bias-vamic1 : Analoge Mic wishes to use a non-standard Vamic - stericsson,earpeice-cmv : Earpeice voltage (only: 950 | 1100 | 1270 | 1580) -ab8500@5 { +ab8500 { compatible = "stericsson,ab8500"; - reg = <5>; /* mailbox 5 is i2c */ interrupts = <0 40 0x4>; interrupt-controller; #interrupt-cells = <2>; diff --git a/Documentation/devicetree/bindings/mfd/mc13xxx.txt b/Documentation/devicetree/bindings/mfd/mc13xxx.txt index baf07987ae68..abd9e3cb2db7 100644 --- a/Documentation/devicetree/bindings/mfd/mc13xxx.txt +++ b/Documentation/devicetree/bindings/mfd/mc13xxx.txt @@ -10,10 +10,40 @@ Optional properties: - fsl,mc13xxx-uses-touch : Indicate the touchscreen controller is being used Sub-nodes: -- regulators : Contain the regulator nodes. The MC13892 regulators are - bound using their names as listed below with their registers and bits - for enabling. +- regulators : Contain the regulator nodes. The regulators are bound using + their names as listed below with their registers and bits for enabling. +MC13783 regulators: + sw1a : regulator SW1A (register 24, bit 0) + sw1b : regulator SW1B (register 25, bit 0) + sw2a : regulator SW2A (register 26, bit 0) + sw2b : regulator SW2B (register 27, bit 0) + sw3 : regulator SW3 (register 29, bit 20) + vaudio : regulator VAUDIO (register 32, bit 0) + viohi : regulator VIOHI (register 32, bit 3) + violo : regulator VIOLO (register 32, bit 6) + vdig : regulator VDIG (register 32, bit 9) + vgen : regulator VGEN (register 32, bit 12) + vrfdig : regulator VRFDIG (register 32, bit 15) + vrfref : regulator VRFREF (register 32, bit 18) + vrfcp : regulator VRFCP (register 32, bit 21) + vsim : regulator VSIM (register 33, bit 0) + vesim : regulator VESIM (register 33, bit 3) + vcam : regulator VCAM (register 33, bit 6) + vrfbg : regulator VRFBG (register 33, bit 9) + vvib : regulator VVIB (register 33, bit 11) + vrf1 : regulator VRF1 (register 33, bit 12) + vrf2 : regulator VRF2 (register 33, bit 15) + vmmc1 : regulator VMMC1 (register 33, bit 18) + vmmc2 : regulator VMMC2 (register 33, bit 21) + gpo1 : regulator GPO1 (register 34, bit 6) + gpo2 : regulator GPO2 (register 34, bit 8) + gpo3 : regulator GPO3 (register 34, bit 10) + gpo4 : regulator GPO4 (register 34, bit 12) + pwgt1spi : regulator PWGT1SPI (register 34, bit 15) + pwgt2spi : regulator PWGT2SPI (register 34, bit 16) + +MC13892 regulators: vcoincell : regulator VCOINCELL (register 13, bit 23) sw1 : regulator SW1 (register 24, bit 0) sw2 : regulator SW2 (register 25, bit 0) diff --git a/Documentation/devicetree/bindings/misc/sram.txt b/Documentation/devicetree/bindings/misc/sram.txt new file mode 100644 index 000000000000..4d0a00e453a8 --- /dev/null +++ b/Documentation/devicetree/bindings/misc/sram.txt @@ -0,0 +1,16 @@ +Generic on-chip SRAM + +Simple IO memory regions to be managed by the genalloc API. + +Required properties: + +- compatible : mmio-sram + +- reg : SRAM iomem address range + +Example: + +sram: sram@5c000000 { + compatible = "mmio-sram"; + reg = <0x5c000000 0x40000>; /* 256 KiB SRAM at address 0x5c000000 */ +}; diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-single.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-single.txt index 2c81e45f1374..08f0c3d01575 100644 --- a/Documentation/devicetree/bindings/pinctrl/pinctrl-single.txt +++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-single.txt @@ -1,7 +1,9 @@ One-register-per-pin type device tree based pinctrl driver Required properties: -- compatible : "pinctrl-single" +- compatible : "pinctrl-single" or "pinconf-single". + "pinctrl-single" means that pinconf isn't supported. + "pinconf-single" means that generic pinconf is supported. - reg : offset and length of the register set for the mux registers @@ -14,9 +16,61 @@ Optional properties: - pinctrl-single,function-off : function off mode for disabled state if available and same for all registers; if not specified, disabling of pin functions is ignored + - pinctrl-single,bit-per-mux : boolean to indicate that one register controls more than one pin +- pinctrl-single,drive-strength : array of value that are used to configure + drive strength in the pinmux register. They're value of drive strength + current and drive strength mask. + + /* drive strength current, mask */ + pinctrl-single,power-source = <0x30 0xf0>; + +- pinctrl-single,bias-pullup : array of value that are used to configure the + input bias pullup in the pinmux register. + + /* input, enabled pullup bits, disabled pullup bits, mask */ + pinctrl-single,bias-pullup = <0 1 0 1>; + +- pinctrl-single,bias-pulldown : array of value that are used to configure the + input bias pulldown in the pinmux register. + + /* input, enabled pulldown bits, disabled pulldown bits, mask */ + pinctrl-single,bias-pulldown = <2 2 0 2>; + + * Two bits to control input bias pullup and pulldown: User should use + pinctrl-single,bias-pullup & pinctrl-single,bias-pulldown. One bit means + pullup, and the other one bit means pulldown. + * Three bits to control input bias enable, pullup and pulldown. User should + use pinctrl-single,bias-pullup & pinctrl-single,bias-pulldown. Input bias + enable bit should be included in pullup or pulldown bits. + * Although driver could set PIN_CONFIG_BIAS_DISABLE, there's no property as + pinctrl-single,bias-disable. Because pinctrl single driver could implement + it by calling pulldown, pullup disabled. + +- pinctrl-single,input-schmitt : array of value that are used to configure + input schmitt in the pinmux register. In some silicons, there're two input + schmitt value (rising-edge & falling-edge) in the pinmux register. + + /* input schmitt value, mask */ + pinctrl-single,input-schmitt = <0x30 0x70>; + +- pinctrl-single,input-schmitt-enable : array of value that are used to + configure input schmitt enable or disable in the pinmux register. + + /* input, enable bits, disable bits, mask */ + pinctrl-single,input-schmitt-enable = <0x30 0x40 0 0x70>; + +- pinctrl-single,gpio-range : list of value that are used to configure a GPIO + range. They're value of subnode phandle, pin base in pinctrl device, pin + number in this range, GPIO function value of this GPIO range. + The number of parameters is depend on #pinctrl-single,gpio-range-cells + property. + + /* pin base, nr pins & gpio function */ + pinctrl-single,gpio-range = <&range 0 3 0 &range 3 9 1>; + This driver assumes that there is only one register for each pin (unless the pinctrl-single,bit-per-mux is set), and uses the common pinctrl bindings as specified in the pinctrl-bindings.txt document in this directory. @@ -42,6 +96,20 @@ Where 0xdc is the offset from the pinctrl register base address for the device pinctrl register, 0x18 is the desired value, and 0xff is the sub mask to be used when applying this change to the register. + +Optional sub-node: In case some pins could be configured as GPIO in the pinmux +register, those pins could be defined as a GPIO range. This sub-node is required +by pinctrl-single,gpio-range property. + +Required properties in sub-node: +- #pinctrl-single,gpio-range-cells : the number of parameters after phandle in + pinctrl-single,gpio-range property. + + range: gpio-range { + #pinctrl-single,gpio-range-cells = <3>; + }; + + Example: /* SoC common file */ @@ -58,7 +126,7 @@ pmx_core: pinmux@4a100040 { /* second controller instance for pins in wkup domain */ pmx_wkup: pinmux@4a31e040 { - compatible = "pinctrl-single; + compatible = "pinctrl-single"; reg = <0x4a31e040 0x0038>; #address-cells = <1>; #size-cells = <0>; @@ -76,6 +144,29 @@ control_devconf0: pinmux@48002274 { pinctrl-single,function-mask = <0x5F>; }; +/* third controller instance for pins in gpio domain */ +pmx_gpio: pinmux@d401e000 { + compatible = "pinconf-single"; + reg = <0xd401e000 0x0330>; + #address-cells = <1>; + #size-cells = <1>; + ranges; + + pinctrl-single,register-width = <32>; + pinctrl-single,function-mask = <7>; + + /* sparse GPIO range could be supported */ + pinctrl-single,gpio-range = <&range 0 3 0 &range 3 9 1 + &range 12 1 0 &range 13 29 1 + &range 43 1 0 &range 44 49 1 + &range 94 1 1 &range 96 2 1>; + + range: gpio-range { + #pinctrl-single,gpio-range-cells = <3>; + }; +}; + + /* board specific .dts file */ &pmx_core { @@ -96,6 +187,15 @@ control_devconf0: pinmux@48002274 { >; }; + uart0_pins: pinmux_uart0_pins { + pinctrl-single,pins = < + 0x208 0 /* UART0_RXD (IOCFG138) */ + 0x20c 0 /* UART0_TXD (IOCFG139) */ + >; + pinctrl-single,bias-pulldown = <0 2 2>; + pinctrl-single,bias-pullup = <0 1 1>; + }; + /* map uart2 pins */ uart2_pins: pinmux_uart2_pins { pinctrl-single,pins = < @@ -122,6 +222,11 @@ control_devconf0: pinmux@48002274 { }; +&uart1 { + pinctrl-names = "default"; + pinctrl-0 = <&uart0_pins>; +}; + &uart2 { pinctrl-names = "default"; pinctrl-0 = <&uart2_pins>; diff --git a/Documentation/devicetree/bindings/pinctrl/samsung-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/samsung-pinctrl.txt index 4598a47aa0cd..c70fca146e91 100644 --- a/Documentation/devicetree/bindings/pinctrl/samsung-pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/samsung-pinctrl.txt @@ -7,6 +7,7 @@ on-chip controllers onto these pads. Required Properties: - compatible: should be one of the following. + - "samsung,s3c64xx-pinctrl": for S3C64xx-compatible pin-controller, - "samsung,exynos4210-pinctrl": for Exynos4210 compatible pin-controller. - "samsung,exynos4x12-pinctrl": for Exynos4x12 compatible pin-controller. - "samsung,exynos5250-pinctrl": for Exynos5250 compatible pin-controller. @@ -105,6 +106,8 @@ B. External Wakeup Interrupts: For supporting external wakeup interrupts, a - compatible: identifies the type of the external wakeup interrupt controller The possible values are: + - samsung,s3c64xx-wakeup-eint: represents wakeup interrupt controller + found on Samsung S3C64xx SoCs, - samsung,exynos4210-wakeup-eint: represents wakeup interrupt controller found on Samsung Exynos4210 SoC. - interrupt-parent: phandle of the interrupt parent to which the external diff --git a/Documentation/devicetree/bindings/regulator/max8952.txt b/Documentation/devicetree/bindings/regulator/max8952.txt new file mode 100644 index 000000000000..866fcdd0f4eb --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/max8952.txt @@ -0,0 +1,52 @@ +Maxim MAX8952 voltage regulator + +Required properties: +- compatible: must be equal to "maxim,max8952" +- reg: I2C slave address, usually 0x60 +- max8952,dvs-mode-microvolt: array of 4 integer values defining DVS voltages + in microvolts. All values must be from range <770000, 1400000> +- any required generic properties defined in regulator.txt + +Optional properties: +- max8952,vid-gpios: array of two GPIO pins used for DVS voltage selection +- max8952,en-gpio: GPIO used to control enable status of regulator +- max8952,default-mode: index of default DVS voltage, from <0, 3> range +- max8952,sync-freq: sync frequency, must be one of following values: + - 0: 26 MHz + - 1: 13 MHz + - 2: 19.2 MHz + Defaults to 26 MHz if not specified. +- max8952,ramp-speed: voltage ramp speed, must be one of following values: + - 0: 32mV/us + - 1: 16mV/us + - 2: 8mV/us + - 3: 4mV/us + - 4: 2mV/us + - 5: 1mV/us + - 6: 0.5mV/us + - 7: 0.25mV/us + Defaults to 32mV/us if not specified. +- any available generic properties defined in regulator.txt + +Example: + + vdd_arm_reg: pmic@60 { + compatible = "maxim,max8952"; + reg = <0x60>; + + /* max8952-specific properties */ + max8952,vid-gpios = <&gpx0 3 0>, <&gpx0 4 0>; + max8952,en-gpio = <&gpx0 1 0>; + max8952,default-mode = <0>; + max8952,dvs-mode-microvolt = <1250000>, <1200000>, + <1050000>, <950000>; + max8952,sync-freq = <0>; + max8952,ramp-speed = <0>; + + /* generic regulator properties */ + regulator-name = "vdd_arm"; + regulator-min-microvolt = <770000>; + regulator-max-microvolt = <1400000>; + regulator-always-on; + regulator-boot-on; + }; diff --git a/Documentation/devicetree/bindings/rtc/atmel,at91rm9200-rtc.txt b/Documentation/devicetree/bindings/rtc/atmel,at91rm9200-rtc.txt new file mode 100644 index 000000000000..2a3feabd3b22 --- /dev/null +++ b/Documentation/devicetree/bindings/rtc/atmel,at91rm9200-rtc.txt @@ -0,0 +1,15 @@ +Atmel AT91RM9200 Real Time Clock + +Required properties: +- compatible: should be: "atmel,at91rm9200-rtc" +- reg: physical base address of the controller and length of memory mapped + region. +- interrupts: rtc alarm/event interrupt + +Example: + +rtc@fffffe00 { + compatible = "atmel,at91rm9200-rtc"; + reg = <0xfffffe00 0x100>; + interrupts = <1 4 7>; +}; diff --git a/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt b/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt new file mode 100644 index 000000000000..8bf89c643640 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt @@ -0,0 +1,22 @@ +Broadcom BCM2835 SPI0 controller + +The BCM2835 contains two forms of SPI master controller, one known simply as +SPI0, and the other known as the "Universal SPI Master"; part of the +auxilliary block. This binding applies to the SPI0 controller. + +Required properties: +- compatible: Should be "brcm,bcm2835-spi". +- reg: Should contain register location and length. +- interrupts: Should contain interrupt. +- clocks: The clock feeding the SPI controller. + +Example: + +spi@20204000 { + compatible = "brcm,bcm2835-spi"; + reg = <0x7e204000 0x1000>; + interrupts = <2 22>; + clocks = <&clk_spi>; + #address-cells = <1>; + #size-cells = <0>; +}; diff --git a/Documentation/devicetree/bindings/spi/fsl-spi.txt b/Documentation/devicetree/bindings/spi/fsl-spi.txt index 777abd7399d5..b032dd76e9d2 100644 --- a/Documentation/devicetree/bindings/spi/fsl-spi.txt +++ b/Documentation/devicetree/bindings/spi/fsl-spi.txt @@ -4,7 +4,7 @@ Required properties: - cell-index : QE SPI subblock index. 0: QE subblock SPI1 1: QE subblock SPI2 -- compatible : should be "fsl,spi". +- compatible : should be "fsl,spi" or "aeroflexgaisler,spictrl". - mode : the SPI operation mode, it can be "cpu" or "cpu-qe". - reg : Offset and length of the register set for the device - interrupts : <a b> where a is the interrupt number and b is a @@ -14,6 +14,7 @@ Required properties: controller you have. - interrupt-parent : the phandle for the interrupt controller that services interrupts for this device. +- clock-frequency : input clock frequency to non FSL_SOC cores Optional properties: - gpios : specifies the gpio pins to be used for chipselects. diff --git a/Documentation/devicetree/bindings/spi/nvidia,tegra114-spi.txt b/Documentation/devicetree/bindings/spi/nvidia,tegra114-spi.txt new file mode 100644 index 000000000000..91ff771c7e77 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/nvidia,tegra114-spi.txt @@ -0,0 +1,26 @@ +NVIDIA Tegra114 SPI controller. + +Required properties: +- compatible : should be "nvidia,tegra114-spi". +- reg: Should contain SPI registers location and length. +- interrupts: Should contain SPI interrupts. +- nvidia,dma-request-selector : The Tegra DMA controller's phandle and + request selector for this SPI controller. +- This is also require clock named "spi" as per binding document + Documentation/devicetree/bindings/clock/clock-bindings.txt + +Recommended properties: +- spi-max-frequency: Definition as per + Documentation/devicetree/bindings/spi/spi-bus.txt +Example: + +spi@7000d600 { + compatible = "nvidia,tegra114-spi"; + reg = <0x7000d600 0x200>; + interrupts = <0 82 0x04>; + nvidia,dma-request-selector = <&apbdma 16>; + spi-max-frequency = <25000000>; + #address-cells = <1>; + #size-cells = <0>; + status = "disabled"; +}; diff --git a/Documentation/devicetree/bindings/spi/spi-samsung.txt b/Documentation/devicetree/bindings/spi/spi-samsung.txt index a15ffeddfba4..86aa061f069f 100644 --- a/Documentation/devicetree/bindings/spi/spi-samsung.txt +++ b/Documentation/devicetree/bindings/spi/spi-samsung.txt @@ -31,9 +31,6 @@ Required Board Specific Properties: - #address-cells: should be 1. - #size-cells: should be 0. -- gpios: The gpio specifier for clock, mosi and miso interface lines (in the - order specified). The format of the gpio specifier depends on the gpio - controller. Optional Board Specific Properties: @@ -86,9 +83,8 @@ Example: spi_0: spi@12d20000 { #address-cells = <1>; #size-cells = <0>; - gpios = <&gpa2 4 2 3 0>, - <&gpa2 6 2 3 0>, - <&gpa2 7 2 3 0>; + pinctrl-names = "default"; + pinctrl-0 = <&spi0_bus>; w25q80bw@0 { #address-cells = <1>; diff --git a/Documentation/devicetree/bindings/staging/dwc2.txt b/Documentation/devicetree/bindings/staging/dwc2.txt new file mode 100644 index 000000000000..1a1b7cfa4845 --- /dev/null +++ b/Documentation/devicetree/bindings/staging/dwc2.txt @@ -0,0 +1,15 @@ +Platform DesignWare HS OTG USB 2.0 controller +----------------------------------------------------- + +Required properties: +- compatible : "snps,dwc2" +- reg : Should contain 1 register range (address and length) +- interrupts : Should contain 1 interrupt + +Example: + + usb@101c0000 { + compatible = "ralink,rt3050-usb, snps,dwc2"; + reg = <0x101c0000 40000>; + interrupts = <18>; + }; diff --git a/Documentation/devicetree/bindings/staging/imx-drm/fsl-imx-drm.txt b/Documentation/devicetree/bindings/staging/imx-drm/fsl-imx-drm.txt index 07654f0338b6..8071ac20d4b3 100644 --- a/Documentation/devicetree/bindings/staging/imx-drm/fsl-imx-drm.txt +++ b/Documentation/devicetree/bindings/staging/imx-drm/fsl-imx-drm.txt @@ -26,7 +26,7 @@ Required properties: - crtc: the crtc this display is connected to, see below Optional properties: - interface_pix_fmt: How this display is connected to the - crtc. Currently supported types: "rgb24", "rgb565" + crtc. Currently supported types: "rgb24", "rgb565", "bgr666" - edid: verbatim EDID data block describing attached display. - ddc: phandle describing the i2c bus handling the display data channel diff --git a/Documentation/devicetree/bindings/tty/serial/of-serial.txt b/Documentation/devicetree/bindings/tty/serial/of-serial.txt index 1e1145ca4f3c..1928a3e83cd0 100644 --- a/Documentation/devicetree/bindings/tty/serial/of-serial.txt +++ b/Documentation/devicetree/bindings/tty/serial/of-serial.txt @@ -11,6 +11,9 @@ Required properties: - "nvidia,tegra20-uart" - "nxp,lpc3220-uart" - "ibm,qpace-nwp-serial" + - "altr,16550-FIFO32" + - "altr,16550-FIFO64" + - "altr,16550-FIFO128" - "serial" if the port type is unknown. - reg : offset and length of the register set for the device. - interrupts : should contain uart interrupt. @@ -30,6 +33,10 @@ Optional properties: RTAS and should not be registered. - no-loopback-test: set to indicate that the port does not implements loopback test mode +- fifo-size: the fifo size of the UART. +- auto-flow-control: one way to enable automatic flow control support. The + driver is allowed to detect support for the capability even without this + property. Example: diff --git a/Documentation/devicetree/bindings/usb/ci13xxx-imx.txt b/Documentation/devicetree/bindings/usb/ci13xxx-imx.txt index 5778b9c83bd8..1c04a4c9515f 100644 --- a/Documentation/devicetree/bindings/usb/ci13xxx-imx.txt +++ b/Documentation/devicetree/bindings/usb/ci13xxx-imx.txt @@ -11,6 +11,7 @@ Optional properties: that indicate usb controller index - vbus-supply: regulator for vbus - disable-over-current: disable over current detect +- external-vbus-divider: enables off-chip resistor divider for Vbus Examples: usb@02184000 { /* USB OTG */ @@ -20,4 +21,5 @@ usb@02184000 { /* USB OTG */ fsl,usbphy = <&usbphy1>; fsl,usbmisc = <&usbmisc 0>; disable-over-current; + external-vbus-divider; }; diff --git a/Documentation/devicetree/bindings/usb/ehci-omap.txt b/Documentation/devicetree/bindings/usb/ehci-omap.txt new file mode 100644 index 000000000000..485a9a1efa7a --- /dev/null +++ b/Documentation/devicetree/bindings/usb/ehci-omap.txt @@ -0,0 +1,32 @@ +OMAP HS USB EHCI controller + +This device is usually the child of the omap-usb-host +Documentation/devicetree/bindings/mfd/omap-usb-host.txt + +Required properties: + +- compatible: should be "ti,ehci-omap" +- reg: should contain one register range i.e. start and length +- interrupts: description of the interrupt line + +Optional properties: + +- phys: list of phandles to PHY nodes. + This property is required if at least one of the ports are in + PHY mode i.e. OMAP_EHCI_PORT_MODE_PHY + +To specify the port mode, see +Documentation/devicetree/bindings/mfd/omap-usb-host.txt + +Example for OMAP4: + +usbhsehci: ehci@4a064c00 { + compatible = "ti,ehci-omap", "usb-ehci"; + reg = <0x4a064c00 0x400>; + interrupts = <0 77 0x4>; +}; + +&usbhsehci { + phys = <&hsusb1_phy 0 &hsusb3_phy>; +}; + diff --git a/Documentation/devicetree/bindings/usb/ohci-omap3.txt b/Documentation/devicetree/bindings/usb/ohci-omap3.txt new file mode 100644 index 000000000000..14ab42812a8e --- /dev/null +++ b/Documentation/devicetree/bindings/usb/ohci-omap3.txt @@ -0,0 +1,15 @@ +OMAP HS USB OHCI controller (OMAP3 and later) + +Required properties: + +- compatible: should be "ti,ohci-omap3" +- reg: should contain one register range i.e. start and length +- interrupts: description of the interrupt line + +Example for OMAP4: + +usbhsohci: ohci@4a064800 { + compatible = "ti,ohci-omap3", "usb-ohci"; + reg = <0x4a064800 0x400>; + interrupts = <0 76 0x4>; +}; diff --git a/Documentation/devicetree/bindings/usb/omap-usb.txt b/Documentation/devicetree/bindings/usb/omap-usb.txt index 1ef0ce71f8fa..662f0f1d2315 100644 --- a/Documentation/devicetree/bindings/usb/omap-usb.txt +++ b/Documentation/devicetree/bindings/usb/omap-usb.txt @@ -8,10 +8,10 @@ OMAP MUSB GLUE and disconnect. - multipoint : Should be "1" indicating the musb controller supports multipoint. This is a MUSB configuration-specific setting. - - num_eps : Specifies the number of endpoints. This is also a + - num-eps : Specifies the number of endpoints. This is also a MUSB configuration-specific setting. Should be set to "16" - - ram_bits : Specifies the ram address size. Should be set to "12" - - interface_type : This is a board specific setting to describe the type of + - ram-bits : Specifies the ram address size. Should be set to "12" + - interface-type : This is a board specific setting to describe the type of interface between the controller and the phy. It should be "0" or "1" specifying ULPI and UTMI respectively. - mode : Should be "3" to represent OTG. "1" signifies HOST and "2" @@ -29,18 +29,46 @@ usb_otg_hs: usb_otg_hs@4a0ab000 { ti,hwmods = "usb_otg_hs"; ti,has-mailbox; multipoint = <1>; - num_eps = <16>; - ram_bits = <12>; + num-eps = <16>; + ram-bits = <12>; ctrl-module = <&omap_control_usb>; }; Board specific device node entry &usb_otg_hs { - interface_type = <1>; + interface-type = <1>; mode = <3>; power = <50>; }; +OMAP DWC3 GLUE + - compatible : Should be "ti,dwc3" + - ti,hwmods : Should be "usb_otg_ss" + - reg : Address and length of the register set for the device. + - interrupts : The irq number of this device that is used to interrupt the + MPU + - #address-cells, #size-cells : Must be present if the device has sub-nodes + - utmi-mode : controls the source of UTMI/PIPE status for VBUS and OTG ID. + It should be set to "1" for HW mode and "2" for SW mode. + - ranges: the child address space are mapped 1:1 onto the parent address space + +Sub-nodes: +The dwc3 core should be added as subnode to omap dwc3 glue. +- dwc3 : + The binding details of dwc3 can be found in: + Documentation/devicetree/bindings/usb/dwc3.txt + +omap_dwc3 { + compatible = "ti,dwc3"; + ti,hwmods = "usb_otg_ss"; + reg = <0x4a020000 0x1ff>; + interrupts = <0 93 4>; + #address-cells = <1>; + #size-cells = <1>; + utmi-mode = <2>; + ranges; +}; + OMAP CONTROL USB Required properties: diff --git a/Documentation/devicetree/bindings/usb/samsung-usbphy.txt b/Documentation/devicetree/bindings/usb/samsung-usbphy.txt index 033194934f64..f575302e5173 100644 --- a/Documentation/devicetree/bindings/usb/samsung-usbphy.txt +++ b/Documentation/devicetree/bindings/usb/samsung-usbphy.txt @@ -1,20 +1,25 @@ -* Samsung's usb phy transceiver +SAMSUNG USB-PHY controllers -The Samsung's phy transceiver is used for controlling usb phy for -s3c-hsotg as well as ehci-s5p and ohci-exynos usb controllers -across Samsung SOCs. +** Samsung's usb 2.0 phy transceiver + +The Samsung's usb 2.0 phy transceiver is used for controlling +usb 2.0 phy for s3c-hsotg as well as ehci-s5p and ohci-exynos +usb controllers across Samsung SOCs. TODO: Adding the PHY binding with controller(s) according to the under developement generic PHY driver. Required properties: Exynos4210: -- compatible : should be "samsung,exynos4210-usbphy" +- compatible : should be "samsung,exynos4210-usb2phy" - reg : base physical address of the phy registers and length of memory mapped region. +- clocks: Clock IDs array as required by the controller. +- clock-names: names of clock correseponding IDs clock property as requested + by the controller driver. Exynos5250: -- compatible : should be "samsung,exynos5250-usbphy" +- compatible : should be "samsung,exynos5250-usb2phy" - reg : base physical address of the phy registers and length of memory mapped region. @@ -44,12 +49,69 @@ Example: usbphy@125B0000 { #address-cells = <1>; #size-cells = <1>; - compatible = "samsung,exynos4210-usbphy"; + compatible = "samsung,exynos4210-usb2phy"; reg = <0x125B0000 0x100>; ranges; + clocks = <&clock 2>, <&clock 305>; + clock-names = "xusbxti", "otg"; + usbphy-sys { /* USB device and host PHY_CONTROL registers */ reg = <0x10020704 0x8>; }; }; + + +** Samsung's usb 3.0 phy transceiver + +Starting exynso5250, Samsung's SoC have usb 3.0 phy transceiver +which is used for controlling usb 3.0 phy for dwc3-exynos usb 3.0 +controllers across Samsung SOCs. + +Required properties: + +Exynos5250: +- compatible : should be "samsung,exynos5250-usb3phy" +- reg : base physical address of the phy registers and length of memory mapped + region. +- clocks: Clock IDs array as required by the controller. +- clock-names: names of clocks correseponding to IDs in the clock property + as requested by the controller driver. + +Optional properties: +- #address-cells: should be '1' when usbphy node has a child node with 'reg' + property. +- #size-cells: should be '1' when usbphy node has a child node with 'reg' + property. +- ranges: allows valid translation between child's address space and parent's + address space. + +- The child node 'usbphy-sys' to the node 'usbphy' is for the system controller + interface for usb-phy. It should provide the following information required by + usb-phy controller to control phy. + - reg : base physical address of PHY_CONTROL registers. + The size of this register is the total sum of size of all PHY_CONTROL + registers that the SoC has. For example, the size will be + '0x4' in case we have only one PHY_CONTROL register (e.g. + OTHERS register in S3C64XX or USB_PHY_CONTROL register in S5PV210) + and, '0x8' in case we have two PHY_CONTROL registers (e.g. + USBDEVICE_PHY_CONTROL and USBHOST_PHY_CONTROL registers in exynos4x). + and so on. + +Example: + usbphy@12100000 { + compatible = "samsung,exynos5250-usb3phy"; + reg = <0x12100000 0x100>; + #address-cells = <1>; + #size-cells = <1>; + ranges; + + clocks = <&clock 1>, <&clock 286>; + clock-names = "ext_xtal", "usbdrd30"; + + usbphy-sys { + /* USB device and host PHY_CONTROL registers */ + reg = <0x10040704 0x8>; + }; + }; diff --git a/Documentation/devicetree/bindings/usb/usb-nop-xceiv.txt b/Documentation/devicetree/bindings/usb/usb-nop-xceiv.txt new file mode 100644 index 000000000000..d7e272671c7e --- /dev/null +++ b/Documentation/devicetree/bindings/usb/usb-nop-xceiv.txt @@ -0,0 +1,34 @@ +USB NOP PHY + +Required properties: +- compatible: should be usb-nop-xceiv + +Optional properties: +- clocks: phandle to the PHY clock. Use as per Documentation/devicetree + /bindings/clock/clock-bindings.txt + This property is required if clock-frequency is specified. + +- clock-names: Should be "main_clk" + +- clock-frequency: the clock frequency (in Hz) that the PHY clock must + be configured to. + +- vcc-supply: phandle to the regulator that provides RESET to the PHY. + +- reset-supply: phandle to the regulator that provides power to the PHY. + +Example: + + hsusb1_phy { + compatible = "usb-nop-xceiv"; + clock-frequency = <19200000>; + clocks = <&osc 0>; + clock-names = "main_clk"; + vcc-supply = <&hsusb1_vcc_regulator>; + reset-supply = <&hsusb1_reset_regulator>; + }; + +hsusb1_phy is a NOP USB PHY device that gets its clock from an oscillator +and expects that clock to be configured to 19.2MHz by the NOP PHY driver. +hsusb1_vcc_regulator provides power to the PHY and hsusb1_reset_regulator +controls RESET. diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt index 19e1ef73ab0d..4d1919bf2332 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.txt +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt @@ -5,6 +5,7 @@ using them to avoid name-space collisions. ad Avionic Design GmbH adi Analog Devices, Inc. +aeroflexgaisler Aeroflex Gaisler AB ak Asahi Kasei Corp. amcc Applied Micro Circuits Corporation (APM, formally AMCC) apm Applied Micro Circuits Corporation (APM) @@ -48,6 +49,7 @@ samsung Samsung Semiconductor sbs Smart Battery System schindler Schindler sil Silicon Image +silabs Silicon Laboratories simtek sirf SiRF Technology, Inc. snps Synopsys, Inc. diff --git a/Documentation/devicetree/bindings/video/backlight/lp855x.txt b/Documentation/devicetree/bindings/video/backlight/lp855x.txt new file mode 100644 index 000000000000..1482103d288f --- /dev/null +++ b/Documentation/devicetree/bindings/video/backlight/lp855x.txt @@ -0,0 +1,41 @@ +lp855x bindings + +Required properties: + - compatible: "ti,lp8550", "ti,lp8551", "ti,lp8552", "ti,lp8553", + "ti,lp8556", "ti,lp8557" + - reg: I2C slave address (u8) + - dev-ctrl: Value of DEVICE CONTROL register (u8). It depends on the device. + +Optional properties: + - bl-name: Backlight device name (string) + - init-brt: Initial value of backlight brightness (u8) + - pwm-period: PWM period value. Set only PWM input mode used (u32) + - rom-addr: Register address of ROM area to be updated (u8) + - rom-val: Register value to be updated (u8) + +Example: + + /* LP8556 */ + backlight@2c { + compatible = "ti,lp8556"; + reg = <0x2c>; + + bl-name = "lcd-bl"; + dev-ctrl = /bits/ 8 <0x85>; + init-brt = /bits/ 8 <0x10>; + }; + + /* LP8557 */ + backlight@2c { + compatible = "ti,lp8557"; + reg = <0x2c>; + + dev-ctrl = /bits/ 8 <0x41>; + init-brt = /bits/ 8 <0x0a>; + + /* 4V OV, 4 output LED string enabled */ + rom_14h { + rom-addr = /bits/ 8 <0x14>; + rom-val = /bits/ 8 <0xcf>; + }; + }; diff --git a/Documentation/devicetree/bindings/video/backlight/tps65217-backlight.txt b/Documentation/devicetree/bindings/video/backlight/tps65217-backlight.txt new file mode 100644 index 000000000000..5fb9279ac287 --- /dev/null +++ b/Documentation/devicetree/bindings/video/backlight/tps65217-backlight.txt @@ -0,0 +1,27 @@ +TPS65217 family of regulators + +The TPS65217 chip contains a boost converter and current sinks which can be +used to drive LEDs for use as backlights. + +Required properties: +- compatible: "ti,tps65217" +- reg: I2C slave address +- backlight: node for specifying WLED1 and WLED2 lines in TPS65217 +- isel: selection bit, valid values: 1 for ISEL1 (low-level) and 2 for ISEL2 (high-level) +- fdim: PWM dimming frequency, valid values: 100, 200, 500, 1000 +- default-brightness: valid values: 0-100 + +Each regulator is defined using the standard binding for regulators. + +Example: + + tps: tps@24 { + reg = <0x24>; + compatible = "ti,tps65217"; + backlight { + isel = <1>; /* 1 - ISET1, 2 ISET2 */ + fdim = <100>; /* TPS65217_BL_FDIM_100HZ */ + default-brightness = <50>; + }; + }; + diff --git a/Documentation/devicetree/bindings/video/via,vt8500-fb.txt b/Documentation/devicetree/bindings/video/via,vt8500-fb.txt index c870b6478ec8..2871e218a0fb 100644 --- a/Documentation/devicetree/bindings/video/via,vt8500-fb.txt +++ b/Documentation/devicetree/bindings/video/via,vt8500-fb.txt @@ -5,58 +5,32 @@ Required properties: - compatible : "via,vt8500-fb" - reg : Should contain 1 register ranges(address and length) - interrupts : framebuffer controller interrupt -- display: a phandle pointing to the display node +- bits-per-pixel : bit depth of framebuffer (16 or 32) -Required nodes: -- display: a display node is required to initialize the lcd panel - This should be in the board dts. -- default-mode: a videomode within the display with timing parameters - as specified below. +Required subnodes: +- display-timings: see display-timing.txt for information Example: - fb@d800e400 { + fb@d8050800 { compatible = "via,vt8500-fb"; reg = <0xd800e400 0x400>; interrupts = <12>; - display = <&display>; - default-mode = <&mode0>; - }; - -VIA VT8500 Display ------------------------------------------------------ -Required properties (as per of_videomode_helper): - - - hactive, vactive: Display resolution - - hfront-porch, hback-porch, hsync-len: Horizontal Display timing parameters - in pixels - vfront-porch, vback-porch, vsync-len: Vertical display timing parameters in - lines - - clock: displayclock in Hz - - bpp: lcd panel bit-depth. - <16> for RGB565, <32> for RGB888 - -Optional properties (as per of_videomode_helper): - - width-mm, height-mm: Display dimensions in mm - - hsync-active-high (bool): Hsync pulse is active high - - vsync-active-high (bool): Vsync pulse is active high - - interlaced (bool): This is an interlaced mode - - doublescan (bool): This is a doublescan mode + bits-per-pixel = <16>; -Example: - display: display@0 { - modes { - mode0: mode@0 { + display-timings { + native-mode = <&timing0>; + timing0: 800x480 { + clock-frequency = <0>; /* unused but required */ hactive = <800>; vactive = <480>; - hback-porch = <88>; hfront-porch = <40>; + hback-porch = <88>; hsync-len = <0>; vback-porch = <32>; vfront-porch = <11>; vsync-len = <1>; - clock = <0>; /* unused but required */ - bpp = <16>; /* non-standard but required */ }; }; }; + diff --git a/Documentation/devicetree/bindings/video/wm,wm8505-fb.txt b/Documentation/devicetree/bindings/video/wm,wm8505-fb.txt index 3d325e1d11ee..0bcadb2840a5 100644 --- a/Documentation/devicetree/bindings/video/wm,wm8505-fb.txt +++ b/Documentation/devicetree/bindings/video/wm,wm8505-fb.txt @@ -4,20 +4,30 @@ Wondermedia WM8505 Framebuffer Required properties: - compatible : "wm,wm8505-fb" - reg : Should contain 1 register ranges(address and length) -- via,display: a phandle pointing to the display node +- bits-per-pixel : bit depth of framebuffer (16 or 32) -Required nodes: -- display: a display node is required to initialize the lcd panel - This should be in the board dts. See definition in - Documentation/devicetree/bindings/video/via,vt8500-fb.txt -- default-mode: a videomode node as specified in - Documentation/devicetree/bindings/video/via,vt8500-fb.txt +Required subnodes: +- display-timings: see display-timing.txt for information Example: - fb@d8050800 { + fb@d8051700 { compatible = "wm,wm8505-fb"; - reg = <0xd8050800 0x200>; - display = <&display>; - default-mode = <&mode0>; + reg = <0xd8051700 0x200>; + bits-per-pixel = <16>; + + display-timings { + native-mode = <&timing0>; + timing0: 800x480 { + clock-frequency = <0>; /* unused but required */ + hactive = <800>; + vactive = <480>; + hfront-porch = <40>; + hback-porch = <88>; + hsync-len = <0>; + vback-porch = <32>; + vfront-porch = <11>; + vsync-len = <1>; + }; + }; }; diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index d230dd9c99b0..4a93e98b290a 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt @@ -150,12 +150,28 @@ discard -- If set, issues discard/TRIM commands to the block device when blocks are freed. This is useful for SSD devices and sparse/thinly-provisoned LUNs. -nfs -- This option maintains an index (cache) of directory - inodes by i_logstart which is used by the nfs-related code to - improve look-ups. +nfs=stale_rw|nostale_ro + Enable this only if you want to export the FAT filesystem + over NFS. + + stale_rw: This option maintains an index (cache) of directory + inodes by i_logstart which is used by the nfs-related code to + improve look-ups. Full file operations (read/write) over NFS is + supported but with cache eviction at NFS server, this could + result in ESTALE issues. + + nostale_ro: This option bases the inode number and filehandle + on the on-disk location of a file in the MS-DOS directory entry. + This ensures that ESTALE will not be returned after a file is + evicted from the inode cache. However, it means that operations + such as rename, create and unlink could cause filehandles that + previously pointed at one file to point at a different file, + potentially causing data corruption. For this reason, this + option also mounts the filesystem readonly. + + To maintain backward compatibility, '-o nfs' is also accepted, + defaulting to stale_rw - Enable this only if you want to export the FAT filesystem - over NFS <bool>: 0,1,yes,no,true,false diff --git a/Documentation/hwmon/adm1275 b/Documentation/hwmon/adm1275 index 2cfa25667123..15b4a20d5062 100644 --- a/Documentation/hwmon/adm1275 +++ b/Documentation/hwmon/adm1275 @@ -15,7 +15,7 @@ Supported chips: Addresses scanned: - Datasheet: www.analog.com/static/imported-files/data_sheets/ADM1276.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/adt7410 b/Documentation/hwmon/adt7410 index 96004000dc2a..9817941e5f19 100644 --- a/Documentation/hwmon/adt7410 +++ b/Documentation/hwmon/adt7410 @@ -4,28 +4,50 @@ Kernel driver adt7410 Supported chips: * Analog Devices ADT7410 Prefix: 'adt7410' - Addresses scanned: I2C 0x48 - 0x4B + Addresses scanned: None Datasheet: Publicly available at the Analog Devices website http://www.analog.com/static/imported-files/data_sheets/ADT7410.pdf + * Analog Devices ADT7420 + Prefix: 'adt7420' + Addresses scanned: None + Datasheet: Publicly available at the Analog Devices website + http://www.analog.com/static/imported-files/data_sheets/ADT7420.pdf + * Analog Devices ADT7310 + Prefix: 'adt7310' + Addresses scanned: None + Datasheet: Publicly available at the Analog Devices website + http://www.analog.com/static/imported-files/data_sheets/ADT7310.pdf + * Analog Devices ADT7320 + Prefix: 'adt7320' + Addresses scanned: None + Datasheet: Publicly available at the Analog Devices website + http://www.analog.com/static/imported-files/data_sheets/ADT7320.pdf Author: Hartmut Knaack <knaack.h@gmx.de> Description ----------- -The ADT7410 is a temperature sensor with rated temperature range of -55°C to -+150°C. It has a high accuracy of +/-0.5°C and can be operated at a resolution -of 13 bits (0.0625°C) or 16 bits (0.0078°C). The sensor provides an INT pin to -indicate that a minimum or maximum temperature set point has been exceeded, as -well as a critical temperature (CT) pin to indicate that the critical -temperature set point has been exceeded. Both pins can be set up with a common -hysteresis of 0°C - 15°C and a fault queue, ranging from 1 to 4 events. Both -pins can individually set to be active-low or active-high, while the whole -device can either run in comparator mode or interrupt mode. The ADT7410 -supports continous temperature sampling, as well as sampling one temperature -value per second or even justget one sample on demand for power saving. -Besides, it can completely power down its ADC, if power management is -required. +The ADT7310/ADT7410 is a temperature sensor with rated temperature range of +-55°C to +150°C. It has a high accuracy of +/-0.5°C and can be operated at a +resolution of 13 bits (0.0625°C) or 16 bits (0.0078°C). The sensor provides an +INT pin to indicate that a minimum or maximum temperature set point has been +exceeded, as well as a critical temperature (CT) pin to indicate that the +critical temperature set point has been exceeded. Both pins can be set up with a +common hysteresis of 0°C - 15°C and a fault queue, ranging from 1 to 4 events. +Both pins can individually set to be active-low or active-high, while the whole +device can either run in comparator mode or interrupt mode. The ADT7410 supports +continuous temperature sampling, as well as sampling one temperature value per +second or even just get one sample on demand for power saving. Besides, it can +completely power down its ADC, if power management is required. + +The ADT7320/ADT7420 is register compatible, the only differences being the +package, a slightly narrower operating temperature range (-40°C to +150°C), and +a better accuracy (0.25°C instead of 0.50°C.) + +The difference between the ADT7310/ADT7320 and ADT7410/ADT7420 is the control +interface, the ADT7310 and ADT7320 use SPI while the ADT7410 and ADT7420 use +I2C. Configuration Notes ------------------- diff --git a/Documentation/hwmon/jc42 b/Documentation/hwmon/jc42 index 165077121238..868d74d6b773 100644 --- a/Documentation/hwmon/jc42 +++ b/Documentation/hwmon/jc42 @@ -49,7 +49,7 @@ Supported chips: Addresses scanned: I2C 0x18 - 0x1f Author: - Guenter Roeck <guenter.roeck@ericsson.com> + Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/lineage-pem b/Documentation/hwmon/lineage-pem index 2ba5ed126858..83b2ddc160c8 100644 --- a/Documentation/hwmon/lineage-pem +++ b/Documentation/hwmon/lineage-pem @@ -8,7 +8,7 @@ Supported devices: Documentation: http://www.lineagepower.com/oem/pdf/CPLI2C.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/lm25066 b/Documentation/hwmon/lm25066 index a21db81c4591..c1b57d72efc3 100644 --- a/Documentation/hwmon/lm25066 +++ b/Documentation/hwmon/lm25066 @@ -1,7 +1,13 @@ -Kernel driver max8688 +Kernel driver lm25066 ===================== Supported chips: + * TI LM25056 + Prefix: 'lm25056' + Addresses scanned: - + Datasheets: + http://www.ti.com/lit/gpn/lm25056 + http://www.ti.com/lit/gpn/lm25056a * National Semiconductor LM25066 Prefix: 'lm25066' Addresses scanned: - @@ -19,14 +25,15 @@ Supported chips: Datasheet: http://www.national.com/pf/LM/LM5066.html -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description ----------- -This driver supports hardware montoring for National Semiconductor LM25066, -LM5064, and LM5064 Power Management, Monitoring, Control, and Protection ICs. +This driver supports hardware montoring for National Semiconductor / TI LM25056, +LM25066, LM5064, and LM5064 Power Management, Monitoring, Control, and +Protection ICs. The driver is a client driver to the core PMBus driver. Please see Documentation/hwmon/pmbus for details on PMBus client drivers. @@ -60,14 +67,19 @@ in1_max Maximum input voltage. in1_min_alarm Input voltage low alarm. in1_max_alarm Input voltage high alarm. -in2_label "vout1" -in2_input Measured output voltage. -in2_average Average measured output voltage. -in2_min Minimum output voltage. -in2_min_alarm Output voltage low alarm. - -in3_label "vout2" -in3_input Measured voltage on vaux pin +in2_label "vmon" +in2_input Measured voltage on VAUX pin +in2_min Minimum VAUX voltage (LM25056 only). +in2_max Maximum VAUX voltage (LM25056 only). +in2_min_alarm VAUX voltage low alarm (LM25056 only). +in2_max_alarm VAUX voltage high alarm (LM25056 only). + +in3_label "vout1" + Not supported on LM25056. +in3_input Measured output voltage. +in3_average Average measured output voltage. +in3_min Minimum output voltage. +in3_min_alarm Output voltage low alarm. curr1_label "iin" curr1_input Measured input current. diff --git a/Documentation/hwmon/lm75 b/Documentation/hwmon/lm75 index c91a1d15fa28..69af1c7db6b7 100644 --- a/Documentation/hwmon/lm75 +++ b/Documentation/hwmon/lm75 @@ -23,7 +23,7 @@ Supported chips: Datasheet: Publicly available at the Maxim website http://www.maxim-ic.com/ * Microchip (TelCom) TCN75 - Prefix: 'lm75' + Prefix: 'tcn75' Addresses scanned: none Datasheet: Publicly available at the Microchip website http://www.microchip.com/ diff --git a/Documentation/hwmon/lm95234 b/Documentation/hwmon/lm95234 new file mode 100644 index 000000000000..a0e95ddfd372 --- /dev/null +++ b/Documentation/hwmon/lm95234 @@ -0,0 +1,36 @@ +Kernel driver lm95234 +===================== + +Supported chips: + * National Semiconductor / Texas Instruments LM95234 + Addresses scanned: I2C 0x18, 0x4d, 0x4e + Datasheet: Publicly available at the Texas Instruments website + http://www.ti.com/product/lm95234 + + +Author: Guenter Roeck <linux@roeck-us.net> + +Description +----------- + +LM95234 is an 11-bit digital temperature sensor with a 2-wire System Management +Bus (SMBus) interface and TrueTherm technology that can very accurately monitor +the temperature of four remote diodes as well as its own temperature. +The four remote diodes can be external devices such as microprocessors, +graphics processors or diode-connected 2N3904s. The LM95234's TruTherm +beta compensation technology allows sensing of 90 nm or 65 nm process +thermal diodes accurately. + +All temperature values are given in millidegrees Celsius. Temperature +is provided within a range of -127 to +255 degrees (+127.875 degrees for +the internal sensor). Resolution depends on temperature input and range. + +Each sensor has its own maximum limit, but the hysteresis is common to all +channels. The hysteresis is configurable with the tem1_max_hyst attribute and +affects the hysteresis on all channels. The first two external sensors also +have a critical limit. + +The lm95234 driver can change its update interval to a fixed set of values. +It will round up to the next selectable interval. See the datasheet for exact +values. Reading sensor values more often will do no harm, but will return +'old' values. diff --git a/Documentation/hwmon/ltc2978 b/Documentation/hwmon/ltc2978 index c365f9beb5dd..dc0d08c61305 100644 --- a/Documentation/hwmon/ltc2978 +++ b/Documentation/hwmon/ltc2978 @@ -2,24 +2,32 @@ Kernel driver ltc2978 ===================== Supported chips: + * Linear Technology LTC2974 + Prefix: 'ltc2974' + Addresses scanned: - + Datasheet: http://www.linear.com/product/ltc2974 * Linear Technology LTC2978 Prefix: 'ltc2978' Addresses scanned: - - Datasheet: http://cds.linear.com/docs/Datasheet/2978fa.pdf + Datasheet: http://www.linear.com/product/ltc2978 * Linear Technology LTC3880 Prefix: 'ltc3880' Addresses scanned: - - Datasheet: http://cds.linear.com/docs/Datasheet/3880f.pdf + Datasheet: http://www.linear.com/product/ltc3880 + * Linear Technology LTC3883 + Prefix: 'ltc3883' + Addresses scanned: - + Datasheet: http://www.linear.com/product/ltc3883 -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description ----------- -The LTC2978 is an octal power supply monitor, supervisor, sequencer and -margin controller. The LTC3880 is a dual, PolyPhase DC/DC synchronous -step-down switching regulator controller. +LTC2974 is a quad digital power supply manager. LTC2978 is an octal power supply +monitor. LTC3880 is a dual output poly-phase step-down DC/DC controller. LTC3883 +is a single phase step-down DC/DC controller. Usage Notes @@ -41,63 +49,90 @@ Sysfs attributes in1_label "vin" in1_input Measured input voltage. in1_min Minimum input voltage. -in1_max Maximum input voltage. -in1_lcrit Critical minimum input voltage. +in1_max Maximum input voltage. LTC2974 and LTC2978 only. +in1_lcrit Critical minimum input voltage. LTC2974 and LTC2978 + only. in1_crit Critical maximum input voltage. in1_min_alarm Input voltage low alarm. -in1_max_alarm Input voltage high alarm. -in1_lcrit_alarm Input voltage critical low alarm. +in1_max_alarm Input voltage high alarm. LTC2974 and LTC2978 only. +in1_lcrit_alarm Input voltage critical low alarm. LTC2974 and LTC2978 + only. in1_crit_alarm Input voltage critical high alarm. -in1_lowest Lowest input voltage. LTC2978 only. +in1_lowest Lowest input voltage. LTC2974 and LTC2978 only. in1_highest Highest input voltage. -in1_reset_history Reset history. Writing into this attribute will reset - history for all attributes. - -in[2-9]_label "vout[1-8]". Channels 3 to 9 on LTC2978 only. -in[2-9]_input Measured output voltage. -in[2-9]_min Minimum output voltage. -in[2-9]_max Maximum output voltage. -in[2-9]_lcrit Critical minimum output voltage. -in[2-9]_crit Critical maximum output voltage. -in[2-9]_min_alarm Output voltage low alarm. -in[2-9]_max_alarm Output voltage high alarm. -in[2-9]_lcrit_alarm Output voltage critical low alarm. -in[2-9]_crit_alarm Output voltage critical high alarm. -in[2-9]_lowest Lowest output voltage. LTC2978 only. -in[2-9]_highest Lowest output voltage. -in[2-9]_reset_history Reset history. Writing into this attribute will reset - history for all attributes. - -temp[1-3]_input Measured temperature. +in1_reset_history Reset input voltage history. + +in[N]_label "vout[1-8]". + LTC2974: N=2-5 + LTC2978: N=2-9 + LTC3880: N=2-3 + LTC3883: N=2 +in[N]_input Measured output voltage. +in[N]_min Minimum output voltage. +in[N]_max Maximum output voltage. +in[N]_lcrit Critical minimum output voltage. +in[N]_crit Critical maximum output voltage. +in[N]_min_alarm Output voltage low alarm. +in[N]_max_alarm Output voltage high alarm. +in[N]_lcrit_alarm Output voltage critical low alarm. +in[N]_crit_alarm Output voltage critical high alarm. +in[N]_lowest Lowest output voltage. LTC2974 and LTC2978 only. +in[N]_highest Highest output voltage. +in[N]_reset_history Reset output voltage history. + +temp[N]_input Measured temperature. + On LTC2974, temp[1-4] report external temperatures, + and temp5 reports the chip temperature. On LTC2978, only one temperature measurement is - supported and reflects the internal temperature. + supported and reports the chip temperature. On LTC3880, temp1 and temp2 report external - temperatures, and temp3 reports the internal - temperature. -temp[1-3]_min Mimimum temperature. -temp[1-3]_max Maximum temperature. -temp[1-3]_lcrit Critical low temperature. -temp[1-3]_crit Critical high temperature. -temp[1-3]_min_alarm Chip temperature low alarm. -temp[1-3]_max_alarm Chip temperature high alarm. -temp[1-3]_lcrit_alarm Chip temperature critical low alarm. -temp[1-3]_crit_alarm Chip temperature critical high alarm. -temp[1-3]_lowest Lowest measured temperature. LTC2978 only. -temp[1-3]_highest Highest measured temperature. -temp[1-3]_reset_history Reset history. Writing into this attribute will reset - history for all attributes. - -power[1-2]_label "pout[1-2]". LTC3880 only. -power[1-2]_input Measured power. - -curr1_label "iin". LTC3880 only. + temperatures, and temp3 reports the chip temperature. + On LTC3883, temp1 reports an external temperature, + and temp2 reports the chip temperature. +temp[N]_min Mimimum temperature. LTC2974 and LTC2978 only. +temp[N]_max Maximum temperature. +temp[N]_lcrit Critical low temperature. +temp[N]_crit Critical high temperature. +temp[N]_min_alarm Temperature low alarm. LTC2974 and LTC2978 only. +temp[N]_max_alarm Temperature high alarm. +temp[N]_lcrit_alarm Temperature critical low alarm. +temp[N]_crit_alarm Temperature critical high alarm. +temp[N]_lowest Lowest measured temperature. LTC2974 and LTC2978 only. + Not supported for chip temperature sensor on LTC2974. +temp[N]_highest Highest measured temperature. Not supported for chip + temperature sensor on LTC2974. +temp[N]_reset_history Reset temperature history. Not supported for chip + temperature sensor on LTC2974. + +power1_label "pin". LTC3883 only. +power1_input Measured input power. + +power[N]_label "pout[1-4]". + LTC2974: N=1-4 + LTC2978: Not supported + LTC3880: N=1-2 + LTC3883: N=2 +power[N]_input Measured output power. + +curr1_label "iin". LTC3880 and LTC3883 only. curr1_input Measured input current. curr1_max Maximum input current. curr1_max_alarm Input current high alarm. - -curr[2-3]_label "iout[1-2]". LTC3880 only. -curr[2-3]_input Measured input current. -curr[2-3]_max Maximum input current. -curr[2-3]_crit Critical input current. -curr[2-3]_max_alarm Input current high alarm. -curr[2-3]_crit_alarm Input current critical high alarm. +curr1_highest Highest input current. LTC3883 only. +curr1_reset_history Reset input current history. LTC3883 only. + +curr[N]_label "iout[1-4]". + LTC2974: N=1-4 + LTC2978: not supported + LTC3880: N=2-3 + LTC3883: N=2 +curr[N]_input Measured output current. +curr[N]_max Maximum output current. +curr[N]_crit Critical high output current. +curr[N]_lcrit Critical low output current. LTC2974 only. +curr[N]_max_alarm Output current high alarm. +curr[N]_crit_alarm Output current critical high alarm. +curr[N]_lcrit_alarm Output current critical low alarm. LTC2974 only. +curr[N]_lowest Lowest output current. LTC2974 only. +curr[N]_highest Highest output current. +curr[N]_reset_history Reset output current history. diff --git a/Documentation/hwmon/ltc4261 b/Documentation/hwmon/ltc4261 index eba2e2c4b94d..9378a75c6134 100644 --- a/Documentation/hwmon/ltc4261 +++ b/Documentation/hwmon/ltc4261 @@ -8,7 +8,7 @@ Supported chips: Datasheet: http://cds.linear.com/docs/Datasheet/42612fb.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/max16064 b/Documentation/hwmon/max16064 index f8b478076f6d..d59cc7829bec 100644 --- a/Documentation/hwmon/max16064 +++ b/Documentation/hwmon/max16064 @@ -7,7 +7,7 @@ Supported chips: Addresses scanned: - Datasheet: http://datasheets.maxim-ic.com/en/ds/MAX16064.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/max16065 b/Documentation/hwmon/max16065 index c11f64a1f2ad..208a29e43010 100644 --- a/Documentation/hwmon/max16065 +++ b/Documentation/hwmon/max16065 @@ -24,7 +24,7 @@ Supported chips: http://datasheets.maxim-ic.com/en/ds/MAX16070-MAX16071.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/max34440 b/Documentation/hwmon/max34440 index 47651ff341ae..37cbf472a19d 100644 --- a/Documentation/hwmon/max34440 +++ b/Documentation/hwmon/max34440 @@ -27,7 +27,7 @@ Supported chips: Addresses scanned: - Datasheet: http://datasheets.maximintegrated.com/en/ds/MAX34461.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/max8688 b/Documentation/hwmon/max8688 index fe849871df32..e78078638b91 100644 --- a/Documentation/hwmon/max8688 +++ b/Documentation/hwmon/max8688 @@ -7,7 +7,7 @@ Supported chips: Addresses scanned: - Datasheet: http://datasheets.maxim-ic.com/en/ds/MAX8688.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/nct6775 b/Documentation/hwmon/nct6775 new file mode 100644 index 000000000000..4e9ef60e8c6c --- /dev/null +++ b/Documentation/hwmon/nct6775 @@ -0,0 +1,188 @@ +Note +==== + +This driver supersedes the NCT6775F and NCT6776F support in the W83627EHF +driver. + +Kernel driver NCT6775 +===================== + +Supported chips: + * Nuvoton NCT5572D/NCT6771F/NCT6772F/NCT6775F/W83677HG-I + Prefix: 'nct6775' + Addresses scanned: ISA address retrieved from Super I/O registers + Datasheet: Available from Nuvoton upon request + * Nuvoton NCT5577D/NCT6776D/NCT6776F + Prefix: 'nct6776' + Addresses scanned: ISA address retrieved from Super I/O registers + Datasheet: Available from Nuvoton upon request + * Nuvoton NCT5532D/NCT6779D + Prefix: 'nct6779' + Addresses scanned: ISA address retrieved from Super I/O registers + Datasheet: Available from Nuvoton upon request + +Authors: + Guenter Roeck <linux@roeck-us.net> + +Description +----------- + +This driver implements support for the Nuvoton NCT6775F, NCT6776F, and NCT6779D +and compatible super I/O chips. + +The chips support up to 25 temperature monitoring sources. Up to 6 of those are +direct temperature sensor inputs, the others are special sources such as PECI, +PCH, and SMBUS. Depending on the chip type, 2 to 6 of the temperature sources +can be monitored and compared against minimum, maximum, and critical +temperatures. The driver reports up to 10 of the temperatures to the user. +There are 4 to 5 fan rotation speed sensors, 8 to 15 analog voltage sensors, +one VID, alarms with beep warnings (control unimplemented), and some automatic +fan regulation strategies (plus manual fan control mode). + +The temperature sensor sources on all chips are configurable. The configured +source for each of the temperature sensors is provided in tempX_label. + +Temperatures are measured in degrees Celsius and measurement resolution is +either 1 degC or 0.5 degC, depending on the temperature source and +configuration. An alarm is triggered when the temperature gets higher than +the high limit; it stays on until the temperature falls below the hysteresis +value. Alarms are only supported for temp1 to temp6, depending on the chip type. + +Fan rotation speeds are reported in RPM (rotations per minute). An alarm is +triggered if the rotation speed has dropped below a programmable limit. On +NCT6775F, fan readings can be divided by a programmable divider (1, 2, 4, 8, +16, 32, 64 or 128) to give the readings more range or accuracy; the other chips +do not have a fan speed divider. The driver sets the most suitable fan divisor +itself; specifically, it increases the divider value each time a fan speed +reading returns an invalid value, and it reduces it if the fan speed reading +is lower than optimal. Some fans might not be present because they share pins +with other functions. + +Voltage sensors (also known as IN sensors) report their values in millivolts. +An alarm is triggered if the voltage has crossed a programmable minimum +or maximum limit. + +The driver supports automatic fan control mode known as Thermal Cruise. +In this mode, the chip attempts to keep the measured temperature in a +predefined temperature range. If the temperature goes out of range, fan +is driven slower/faster to reach the predefined range again. + +The mode works for fan1-fan5. + +sysfs attributes +---------------- + +pwm[1-5] - this file stores PWM duty cycle or DC value (fan speed) in range: + 0 (lowest speed) to 255 (full) + +pwm[1-5]_enable - this file controls mode of fan/temperature control: + * 0 Fan control disabled (fans set to maximum speed) + * 1 Manual mode, write to pwm[0-5] any value 0-255 + * 2 "Thermal Cruise" mode + * 3 "Fan Speed Cruise" mode + * 4 "Smart Fan III" mode (NCT6775F only) + * 5 "Smart Fan IV" mode + +pwm[1-5]_mode - controls if output is PWM or DC level + * 0 DC output + * 1 PWM output + +Common fan control attributes +----------------------------- + +pwm[1-5]_temp_sel Temperature source. Value is temperature sensor index. + For example, select '1' for temp1_input. +pwm[1-5]_weight_temp_sel + Secondary temperature source. Value is temperature + sensor index. For example, select '1' for temp1_input. + Set to 0 to disable secondary temperature control. + +If secondary temperature functionality is enabled, it is controlled with the +following attributes. + +pwm[1-5]_weight_duty_step + Duty step size. +pwm[1-5]_weight_temp_step + Temperature step size. With each step over + temp_step_base, the value of weight_duty_step is added + to the current pwm value. +pwm[1-5]_weight_temp_step_base + Temperature at which secondary temperature control kicks + in. +pwm[1-5]_weight_temp_step_tol + Temperature step tolerance. + +Thermal Cruise mode (2) +----------------------- + +If the temperature is in the range defined by: + +pwm[1-5]_target_temp Target temperature, unit millidegree Celsius + (range 0 - 127000) +pwm[1-5]_temp_tolerance + Target temperature tolerance, unit millidegree Celsius + +there are no changes to fan speed. Once the temperature leaves the interval, fan +speed increases (if temperature is higher that desired) or decreases (if +temperature is lower than desired), using the following limits and time +intervals. + +pwm[1-5]_start fan pwm start value (range 1 - 255), to start fan + when the temperature is above defined range. +pwm[1-5]_floor lowest fan pwm (range 0 - 255) if temperature is below + the defined range. If set to 0, the fan is expected to + stop if the temperature is below the defined range. +pwm[1-5]_step_up_time milliseconds before fan speed is increased +pwm[1-5]_step_down_time milliseconds before fan speed is decreased +pwm[1-5]_stop_time how many milliseconds must elapse to switch + corresponding fan off (when the temperature was below + defined range). + +Speed Cruise mode (3) +--------------------- + +This modes tries to keep the fan speed constant. + +fan[1-5]_target Target fan speed +fan[1-5]_tolerance + Target speed tolerance + + +Untested; use at your own risk. + +Smart Fan IV mode (5) +--------------------- + +This mode offers multiple slopes to control the fan speed. The slopes can be +controlled by setting the pwm and temperature attributes. When the temperature +rises, the chip will calculate the DC/PWM output based on the current slope. +There are up to seven data points depending on the chip type. Subsequent data +points should be set to higher temperatures and higher pwm values to achieve +higher fan speeds with increasing temperature. The last data point reflects +critical temperature mode, in which the fans should run at full speed. + +pwm[1-5]_auto_point[1-7]_pwm + pwm value to be set if temperature reaches matching + temperature range. +pwm[1-5]_auto_point[1-7]_temp + Temperature over which the matching pwm is enabled. +pwm[1-5]_temp_tolerance + Temperature tolerance, unit millidegree Celsius +pwm[1-5]_crit_temp_tolerance + Temperature tolerance for critical temperature, + unit millidegree Celsius + +pwm[1-5]_step_up_time milliseconds before fan speed is increased +pwm[1-5]_step_down_time milliseconds before fan speed is decreased + +Usage Notes +----------- + +On various ASUS boards with NCT6776F, it appears that CPUTIN is not really +connected to anything and floats, or that it is connected to some non-standard +temperature measurement device. As a result, the temperature reported on CPUTIN +will not reflect a usable value. It often reports unreasonably high +temperatures, and in some cases the reported temperature declines if the actual +temperature increases (similar to the raw PECI temperature value - see PECI +specification for details). CPUTIN should therefore be be ignored on ASUS +boards. The CPU temperature on ASUS boards is reported from PECI 0. diff --git a/Documentation/hwmon/pmbus b/Documentation/hwmon/pmbus index 3d3a0f97f966..cf756ed48ff9 100644 --- a/Documentation/hwmon/pmbus +++ b/Documentation/hwmon/pmbus @@ -34,7 +34,7 @@ Supported chips: Addresses scanned: - Datasheet: n.a. -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/sht15 b/Documentation/hwmon/sht15 index 02850bdfac18..778987d1856f 100644 --- a/Documentation/hwmon/sht15 +++ b/Documentation/hwmon/sht15 @@ -40,7 +40,7 @@ bits for humidity, or 12 bits for temperature and 8 bits for humidity. The humidity calibration coefficients are programmed into an OTP memory on the chip. These coefficients are used to internally calibrate the signals from the sensors. Disabling the reload of those coefficients allows saving 10ms for each -measurement and decrease power consumption, while loosing on precision. +measurement and decrease power consumption, while losing on precision. Some options may be set directly in the sht15_platform_data structure or via sysfs attributes. diff --git a/Documentation/hwmon/smm665 b/Documentation/hwmon/smm665 index 59e316140542..a341eeedab75 100644 --- a/Documentation/hwmon/smm665 +++ b/Documentation/hwmon/smm665 @@ -29,7 +29,7 @@ Supported chips: http://www.summitmicro.com/prod_select/summary/SMM766/SMM766_2086.pdf http://www.summitmicro.com/prod_select/summary/SMM766B/SMM766B_2122.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Module Parameters diff --git a/Documentation/hwmon/tmp401 b/Documentation/hwmon/tmp401 index 9fc447249212..f91e3fa7e5ec 100644 --- a/Documentation/hwmon/tmp401 +++ b/Documentation/hwmon/tmp401 @@ -8,8 +8,16 @@ Supported chips: Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp401.html * Texas Instruments TMP411 Prefix: 'tmp411' - Addresses scanned: I2C 0x4c + Addresses scanned: I2C 0x4c, 0x4d, 0x4e Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp411.html + * Texas Instruments TMP431 + Prefix: 'tmp431' + Addresses scanned: I2C 0x4c, 0x4d + Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp431.html + * Texas Instruments TMP432 + Prefix: 'tmp432' + Addresses scanned: I2C 0x4c, 0x4d + Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp432.html Authors: Hans de Goede <hdegoede@redhat.com> @@ -18,19 +26,19 @@ Authors: Description ----------- -This driver implements support for Texas Instruments TMP401 and -TMP411 chips. These chips implements one remote and one local -temperature sensor. Temperature is measured in degrees +This driver implements support for Texas Instruments TMP401, TMP411, +TMP431, and TMP432 chips. These chips implement one or two remote and +one local temperature sensors. Temperature is measured in degrees Celsius. Resolution of the remote sensor is 0.0625 degree. Local sensor resolution can be set to 0.5, 0.25, 0.125 or 0.0625 degree (not supported by the driver so far, so using the default resolution of 0.5 degree). The driver provides the common sysfs-interface for temperatures (see -/Documentation/hwmon/sysfs-interface under Temperatures). +Documentation/hwmon/sysfs-interface under Temperatures). -The TMP411 chip is compatible with TMP401. It provides some additional -features. +The TMP411 and TMP431 chips are compatible with TMP401. TMP411 provides +some additional features. * Minimum and Maximum temperature measured since power-on, chip-reset @@ -40,3 +48,6 @@ features. Exported via sysfs attribute temp_reset_history. Writing 1 to this file triggers a reset. + +TMP432 is compatible with TMP401 and TMP431. It supports two external +temperature sensors. diff --git a/Documentation/hwmon/ucd9000 b/Documentation/hwmon/ucd9000 index 0df5f276505b..805e33edb978 100644 --- a/Documentation/hwmon/ucd9000 +++ b/Documentation/hwmon/ucd9000 @@ -11,7 +11,7 @@ Supported chips: http://focus.ti.com/lit/ds/symlink/ucd9090.pdf http://focus.ti.com/lit/ds/symlink/ucd90910.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/ucd9200 b/Documentation/hwmon/ucd9200 index fd7d07b1908a..1e8060e631bd 100644 --- a/Documentation/hwmon/ucd9200 +++ b/Documentation/hwmon/ucd9200 @@ -15,7 +15,7 @@ Supported chips: http://focus.ti.com/lit/ds/symlink/ucd9246.pdf http://focus.ti.com/lit/ds/symlink/ucd9248.pdf -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description diff --git a/Documentation/hwmon/zl6100 b/Documentation/hwmon/zl6100 index 3d924b6b59e9..33908a4d68ff 100644 --- a/Documentation/hwmon/zl6100 +++ b/Documentation/hwmon/zl6100 @@ -54,7 +54,7 @@ http://archive.ericsson.net/service/internet/picov/get?DocNo=28701-EN/LZT146401 http://archive.ericsson.net/service/internet/picov/get?DocNo=28701-EN/LZT146256 -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description @@ -125,7 +125,7 @@ in2_label "vmon" in2_input Measured voltage on VMON (ZL2004) or VDRV (ZL9101M, ZL9117M) pin. Reported voltage is 16x the voltage on the pin (adjusted internally by the chip). -in2_lcrit Critical minumum VMON/VDRV Voltage. +in2_lcrit Critical minimum VMON/VDRV Voltage. in2_crit Critical maximum VMON/VDRV voltage. in2_lcrit_alarm VMON/VDRV voltage critical low alarm. in2_crit_alarm VMON/VDRV voltage critical high alarm. diff --git a/Documentation/i2c/busses/i2c-diolan-u2c b/Documentation/i2c/busses/i2c-diolan-u2c index 30fe4bb9a069..0d6018c316c7 100644 --- a/Documentation/i2c/busses/i2c-diolan-u2c +++ b/Documentation/i2c/busses/i2c-diolan-u2c @@ -5,7 +5,7 @@ Supported adapters: Documentation: http://www.diolan.com/i2c/u2c12.html -Author: Guenter Roeck <guenter.roeck@ericsson.com> +Author: Guenter Roeck <linux@roeck-us.net> Description ----------- diff --git a/Documentation/ia64/err_inject.txt b/Documentation/ia64/err_inject.txt index 223e4f0582d0..9f651c181429 100644 --- a/Documentation/ia64/err_inject.txt +++ b/Documentation/ia64/err_inject.txt @@ -882,7 +882,7 @@ int err_inj() cpu=parameters[i].cpu; k = cpu%64; j = cpu/64; - mask[j]=1<<k; + mask[j] = 1UL << k; if (sched_setaffinity(0, MASK_SIZE*8, mask)==-1) { perror("Error sched_setaffinity:"); diff --git a/Documentation/input/alps.txt b/Documentation/input/alps.txt index 3262b6e4d686..e544c7ff8cfa 100644 --- a/Documentation/input/alps.txt +++ b/Documentation/input/alps.txt @@ -3,10 +3,26 @@ ALPS Touchpad Protocol Introduction ------------ - -Currently the ALPS touchpad driver supports four protocol versions in use by -ALPS touchpads, called versions 1, 2, 3, and 4. Information about the various -protocol versions is contained in the following sections. +Currently the ALPS touchpad driver supports five protocol versions in use by +ALPS touchpads, called versions 1, 2, 3, 4 and 5. + +Since roughly mid-2010 several new ALPS touchpads have been released and +integrated into a variety of laptops and netbooks. These new touchpads +have enough behavior differences that the alps_model_data definition +table, describing the properties of the different versions, is no longer +adequate. The design choices were to re-define the alps_model_data +table, with the risk of regression testing existing devices, or isolate +the new devices outside of the alps_model_data table. The latter design +choice was made. The new touchpad signatures are named: "Rushmore", +"Pinnacle", and "Dolphin", which you will see in the alps.c code. +For the purposes of this document, this group of ALPS touchpads will +generically be called "new ALPS touchpads". + +We experimented with probing the ACPI interface _HID (Hardware ID)/_CID +(Compatibility ID) definition as a way to uniquely identify the +different ALPS variants but there did not appear to be a 1:1 mapping. +In fact, it appeared to be an m:n mapping between the _HID and actual +hardware type. Detection --------- @@ -20,9 +36,13 @@ If the E6 report is successful, the touchpad model is identified using the "E7 report" sequence: E8-E7-E7-E7-E9. The response is the model signature and is matched against known models in the alps_model_data_array. -With protocol versions 3 and 4, the E7 report model signature is always -73-02-64. To differentiate between these versions, the response from the -"Enter Command Mode" sequence must be inspected as described below. +For older touchpads supporting protocol versions 3 and 4, the E7 report +model signature is always 73-02-64. To differentiate between these +versions, the response from the "Enter Command Mode" sequence must be +inspected as described below. + +The new ALPS touchpads have an E7 signature of 73-03-50 or 73-03-0A but +seem to be better differentiated by the EC Command Mode response. Command Mode ------------ @@ -47,6 +67,14 @@ address of the register being read, and the third contains the value of the register. Registers are written by writing the value one nibble at a time using the same encoding used for addresses. +For the new ALPS touchpads, the EC command is used to enter command +mode. The response in the new ALPS touchpads is significantly different, +and more important in determining the behavior. This code has been +separated from the original alps_model_data table and put in the +alps_identify function. For example, there seem to be two hardware init +sequences for the "Dolphin" touchpads as determined by the second byte +of the EC response. + Packet Format ------------- @@ -187,3 +215,28 @@ There are several things worth noting here. well. So far no v4 devices with tracksticks have been encountered. + +ALPS Absolute Mode - Protocol Version 5 +--------------------------------------- +This is basically Protocol Version 3 but with different logic for packet +decode. It uses the same alps_process_touchpad_packet_v3 call with a +specialized decode_fields function pointer to correctly interpret the +packets. This appears to only be used by the Dolphin devices. + +For single-touch, the 6-byte packet format is: + + byte 0: 1 1 0 0 1 0 0 0 + byte 1: 0 x6 x5 x4 x3 x2 x1 x0 + byte 2: 0 y6 y5 y4 y3 y2 y1 y0 + byte 3: 0 M R L 1 m r l + byte 4: y10 y9 y8 y7 x10 x9 x8 x7 + byte 5: 0 z6 z5 z4 z3 z2 z1 z0 + +For mt, the format is: + + byte 0: 1 1 1 n3 1 n2 n1 x24 + byte 1: 1 y7 y6 y5 y4 y3 y2 y1 + byte 2: ? x2 x1 y12 y11 y10 y9 y8 + byte 3: 0 x23 x22 x21 x20 x19 x18 x17 + byte 4: 0 x9 x8 x7 x6 x5 x4 x3 + byte 5: 0 x16 x15 x14 x13 x12 x11 x10 diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 3210540f8bd3..237acab169dd 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -131,6 +131,7 @@ Code Seq#(hex) Include File Comments 'H' 40-4F sound/hdspm.h conflict! 'H' 40-4F sound/hdsp.h conflict! 'H' 90 sound/usb/usx2y/usb_stream.h +'H' A0 uapi/linux/usb/cdc-wdm.h 'H' C0-F0 net/bluetooth/hci.h conflict! 'H' C0-DF net/bluetooth/hidp/hidp.h conflict! 'H' C0-DF net/bluetooth/cmtp/cmtp.h conflict! diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index 13f1aa09b938..9c7fd988e299 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt @@ -297,6 +297,7 @@ Boot into System Kernel On ia64, 256M@256M is a generous value that typically works. The region may be automatically placed on ia64, see the dump-capture kernel config option notes above. + If use sparse memory, the size should be rounded to GRANULE boundaries. On s390x, typically use "crashkernel=xxM". The value of xx is dependent on the memory consumption of the kdump system. In general this is not diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 4865e9bfd08d..7d55ebb5660c 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -44,6 +44,7 @@ parameter is applicable: AVR32 AVR32 architecture is enabled. AX25 Appropriate AX.25 support is enabled. BLACKFIN Blackfin architecture is enabled. + CLK Common clock infrastructure is enabled. DRM Direct Rendering Management support is enabled. DYNAMIC_DEBUG Build in debug messages and enable them at runtime EDD BIOS Enhanced Disk Drive Services (EDD) is enabled @@ -320,6 +321,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. on: enable for both 32- and 64-bit processes off: disable for both 32- and 64-bit processes + alloc_snapshot [FTRACE] + Allocate the ftrace snapshot buffer on boot up when the + main buffer is allocated. This is handy if debugging + and you need to use tracing_snapshot() on boot up, and + do not want to use tracing_snapshot_alloc() as it needs + to be done where GFP_KERNEL allocations are allowed. + amd_iommu= [HW,X86-64] Pass parameters to the AMD IOMMU driver in the system. Possible values are: @@ -465,6 +473,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. cio_ignore= [S390] See Documentation/s390/CommonIO for details. + clk_ignore_unused + [CLK] + Keep all clocks already enabled by bootloader on, + even if no driver has claimed them. This is useful + for debug and development, but should not be + needed on a platform with proper driver support. + For more information, see Documentation/clk.txt. clock= [BUGS=X86-32, HW] gettimeofday clocksource override. [Deprecated] @@ -596,9 +611,6 @@ bytes respectively. Such letter suffixes can also be entirely omitted. is selected automatically. Check Documentation/kdump/kdump.txt for further details. - crashkernel_low=size[KMG] - [KNL, x86] parts under 4G. - crashkernel=range1:size1[,range2:size2,...][@offset] [KNL] Same as above, but depends on the memory in the running system. The syntax of range is @@ -606,6 +618,26 @@ bytes respectively. Such letter suffixes can also be entirely omitted. a memory unit (amount[KMG]). See also Documentation/kdump/kdump.txt for an example. + crashkernel=size[KMG],high + [KNL, x86_64] range could be above 4G. Allow kernel + to allocate physical memory region from top, so could + be above 4G if system have more than 4G ram installed. + Otherwise memory region will be allocated below 4G, if + available. + It will be ignored if crashkernel=X is specified. + crashkernel=size[KMG],low + [KNL, x86_64] range under 4G. When crashkernel=X,high + is passed, kernel could allocate physical memory region + above 4G, that cause second kernel crash on system + that require some amount of low memory, e.g. swiotlb + requires at least 64M+32K low memory. Kernel would + try to allocate 72M below 4G automatically. + This one let user to specify own low range under 4G + for second kernel instead. + 0: to disable low allocation. + It will be ignored when crashkernel=X,high is not used + or memory reserved is below 4G. + cs89x0_dma= [HW,NET] Format: <dma> @@ -788,6 +820,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. edd= [EDD] Format: {"off" | "on" | "skip[mbr]"} + efi_no_storage_paranoia [EFI; X86] + Using this parameter you can use more than 50% of + your efi variable storage. Use this parameter only if + you are really sure that your UEFI does sane gc and + fulfills the spec otherwise your board may brick. + eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. @@ -2469,9 +2507,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. In kernels built with CONFIG_RCU_NOCB_CPU=y, set the specified list of CPUs to be no-callback CPUs. Invocation of these CPUs' RCU callbacks will - be offloaded to "rcuoN" kthreads created for - that purpose. This reduces OS jitter on the + be offloaded to "rcuox/N" kthreads created for + that purpose, where "x" is "b" for RCU-bh, "p" + for RCU-preempt, and "s" for RCU-sched, and "N" + is the CPU number. This reduces OS jitter on the offloaded CPUs, which can be useful for HPC and + real-time workloads. It can also improve energy efficiency for asymmetric multiprocessors. @@ -2495,6 +2536,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted. leaf rcu_node structure. Useful for very large systems. + rcutree.jiffies_till_first_fqs= [KNL,BOOT] + Set delay from grace-period initialization to + first attempt to force quiescent states. + Units are jiffies, minimum value is zero, + and maximum value is HZ. + + rcutree.jiffies_till_next_fqs= [KNL,BOOT] + Set delay between subsequent attempts to force + quiescent states. Units are jiffies, minimum + value is one, and maximum value is HZ. + rcutree.qhimark= [KNL,BOOT] Set threshold of queued RCU callbacks over which batch limiting is disabled. @@ -2509,16 +2561,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. rcutree.rcu_cpu_stall_timeout= [KNL,BOOT] Set timeout for RCU CPU stall warning messages. - rcutree.jiffies_till_first_fqs= [KNL,BOOT] - Set delay from grace-period initialization to - first attempt to force quiescent states. - Units are jiffies, minimum value is zero, - and maximum value is HZ. + rcutree.rcu_idle_gp_delay= [KNL,BOOT] + Set wakeup interval for idle CPUs that have + RCU callbacks (RCU_FAST_NO_HZ=y). - rcutree.jiffies_till_next_fqs= [KNL,BOOT] - Set delay between subsequent attempts to force - quiescent states. Units are jiffies, minimum - value is one, and maximum value is HZ. + rcutree.rcu_idle_lazy_gp_delay= [KNL,BOOT] + Set wakeup interval for idle CPUs that have + only "lazy" RCU callbacks (RCU_FAST_NO_HZ=y). + Lazy RCU callbacks are those which RCU can + prove do nothing more than free memory. rcutorture.fqs_duration= [KNL,BOOT] Set duration of force_quiescent_state bursts. @@ -3230,6 +3281,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. or other driver-specific files in the Documentation/watchdog/ directory. + workqueue.disable_numa + By default, all work items queued to unbound + workqueues are affine to the NUMA nodes they're + issued on, which results in better behavior in + general. If NUMA affinity needs to be disabled for + whatever reason, this option can be used. Note + that this also can be controlled per-workqueue for + workqueues visible under /sys/bus/workqueue/. + x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of default x2apic cluster mode on platforms supporting x2apic. diff --git a/Documentation/misc-devices/mei/mei-client-bus.txt b/Documentation/misc-devices/mei/mei-client-bus.txt new file mode 100644 index 000000000000..f83910a8ce76 --- /dev/null +++ b/Documentation/misc-devices/mei/mei-client-bus.txt @@ -0,0 +1,138 @@ +Intel(R) Management Engine (ME) Client bus API +=============================================== + + +Rationale +========= +MEI misc character device is useful for dedicated applications to send and receive +data to the many FW appliance found in Intel's ME from the user space. +However for some of the ME functionalities it make sense to leverage existing software +stack and expose them through existing kernel subsystems. + +In order to plug seamlessly into the kernel device driver model we add kernel virtual +bus abstraction on top of the MEI driver. This allows implementing linux kernel drivers +for the various MEI features as a stand alone entities found in their respective subsystem. +Existing device drivers can even potentially be re-used by adding an MEI CL bus layer to +the existing code. + + +MEI CL bus API +=========== +A driver implementation for an MEI Client is very similar to existing bus +based device drivers. The driver registers itself as an MEI CL bus driver through +the mei_cl_driver structure: + +struct mei_cl_driver { + struct device_driver driver; + const char *name; + + const struct mei_cl_device_id *id_table; + + int (*probe)(struct mei_cl_device *dev, const struct mei_cl_id *id); + int (*remove)(struct mei_cl_device *dev); +}; + +struct mei_cl_id { + char name[MEI_NAME_SIZE]; + kernel_ulong_t driver_info; +}; + +The mei_cl_id structure allows the driver to bind itself against a device name. + +To actually register a driver on the ME Client bus one must call the mei_cl_add_driver() +API. This is typically called at module init time. + +Once registered on the ME Client bus, a driver will typically try to do some I/O on +this bus and this should be done through the mei_cl_send() and mei_cl_recv() +routines. The latter is synchronous (blocks and sleeps until data shows up). +In order for drivers to be notified of pending events waiting for them (e.g. +an Rx event) they can register an event handler through the +mei_cl_register_event_cb() routine. Currently only the MEI_EVENT_RX event +will trigger an event handler call and the driver implementation is supposed +to call mei_recv() from the event handler in order to fetch the pending +received buffers. + + +Example +======= +As a theoretical example let's pretend the ME comes with a "contact" NFC IP. +The driver init and exit routines for this device would look like: + +#define CONTACT_DRIVER_NAME "contact" + +static struct mei_cl_device_id contact_mei_cl_tbl[] = { + { CONTACT_DRIVER_NAME, }, + + /* required last entry */ + { } +}; +MODULE_DEVICE_TABLE(mei_cl, contact_mei_cl_tbl); + +static struct mei_cl_driver contact_driver = { + .id_table = contact_mei_tbl, + .name = CONTACT_DRIVER_NAME, + + .probe = contact_probe, + .remove = contact_remove, +}; + +static int contact_init(void) +{ + int r; + + r = mei_cl_driver_register(&contact_driver); + if (r) { + pr_err(CONTACT_DRIVER_NAME ": driver registration failed\n"); + return r; + } + + return 0; +} + +static void __exit contact_exit(void) +{ + mei_cl_driver_unregister(&contact_driver); +} + +module_init(contact_init); +module_exit(contact_exit); + +And the driver's simplified probe routine would look like that: + +int contact_probe(struct mei_cl_device *dev, struct mei_cl_device_id *id) +{ + struct contact_driver *contact; + + [...] + mei_cl_enable_device(dev); + + mei_cl_register_event_cb(dev, contact_event_cb, contact); + + return 0; + } + +In the probe routine the driver first enable the MEI device and then registers +an ME bus event handler which is as close as it can get to registering a +threaded IRQ handler. +The handler implementation will typically call some I/O routine depending on +the pending events: + +#define MAX_NFC_PAYLOAD 128 + +static void contact_event_cb(struct mei_cl_device *dev, u32 events, + void *context) +{ + struct contact_driver *contact = context; + + if (events & BIT(MEI_EVENT_RX)) { + u8 payload[MAX_NFC_PAYLOAD]; + int payload_size; + + payload_size = mei_recv(dev, payload, MAX_NFC_PAYLOAD); + if (payload_size <= 0) + return; + + /* Hook to the NFC subsystem */ + nfc_hci_recv_frame(contact->hdev, payload, payload_size); + } +} diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt index f2a2488f1bf3..9573d0c48c6e 100644 --- a/Documentation/networking/ipvs-sysctl.txt +++ b/Documentation/networking/ipvs-sysctl.txt @@ -15,6 +15,13 @@ amemthresh - INTEGER enabled and the variable is automatically set to 2, otherwise the strategy is disabled and the variable is set to 1. +backup_only - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If set, disable the director function while the server is + in backup mode to avoid packet loops for DR/TUN methods. + conntrack - BOOLEAN 0 - disabled (default) not 0 - enabled diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.txt index c0aab985bad9..949d5dcdd9a3 100644 --- a/Documentation/networking/tuntap.txt +++ b/Documentation/networking/tuntap.txt @@ -105,6 +105,83 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com> Proto [2 bytes] Raw protocol(IP, IPv6, etc) frame. + 3.3 Multiqueue tuntap interface: + + From version 3.8, Linux supports multiqueue tuntap which can uses multiple + file descriptors (queues) to parallelize packets sending or receiving. The + device allocation is the same as before, and if user wants to create multiple + queues, TUNSETIFF with the same device name must be called many times with + IFF_MULTI_QUEUE flag. + + char *dev should be the name of the device, queues is the number of queues to + be created, fds is used to store and return the file descriptors (queues) + created to the caller. Each file descriptor were served as the interface of a + queue which could be accessed by userspace. + + #include <linux/if.h> + #include <linux/if_tun.h> + + int tun_alloc_mq(char *dev, int queues, int *fds) + { + struct ifreq ifr; + int fd, err, i; + + if (!dev) + return -1; + + memset(&ifr, 0, sizeof(ifr)); + /* Flags: IFF_TUN - TUN device (no Ethernet headers) + * IFF_TAP - TAP device + * + * IFF_NO_PI - Do not provide packet information + * IFF_MULTI_QUEUE - Create a queue of multiqueue device + */ + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE; + strcpy(ifr.ifr_name, dev); + + for (i = 0; i < queues; i++) { + if ((fd = open("/dev/net/tun", O_RDWR)) < 0) + goto err; + err = ioctl(fd, TUNSETIFF, (void *)&ifr); + if (err) { + close(fd); + goto err; + } + fds[i] = fd; + } + + return 0; + err: + for (--i; i >= 0; i--) + close(fds[i]); + return err; + } + + A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When + calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when + calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were + enabled by default after it was created through TUNSETIFF. + + fd is the file descriptor (queue) that we want to enable or disable, when + enable is true we enable it, otherwise we disable it + + #include <linux/if.h> + #include <linux/if_tun.h> + + int tun_set_queue(int fd, int enable) + { + struct ifreq ifr; + + memset(&ifr, 0, sizeof(ifr)); + + if (enable) + ifr.ifr_flags = IFF_ATTACH_QUEUE; + else + ifr.ifr_flags = IFF_DETACH_QUEUE; + + return ioctl(fd, TUNSETQUEUE, (void *)&ifr); + } + Universal TUN/TAP device driver Frequently Asked Question. 1. What platforms are supported by TUN/TAP driver ? diff --git a/Documentation/pinctrl.txt b/Documentation/pinctrl.txt index a2b57e0a1db0..447fd4cd54ec 100644 --- a/Documentation/pinctrl.txt +++ b/Documentation/pinctrl.txt @@ -736,6 +736,13 @@ All the above functions are mandatory to implement for a pinmux driver. Pin control interaction with the GPIO subsystem =============================================== +Note that the following implies that the use case is to use a certain pin +from the Linux kernel using the API in <linux/gpio.h> with gpio_request() +and similar functions. There are cases where you may be using something +that your datasheet calls "GPIO mode" but actually is just an electrical +configuration for a certain device. See the section below named +"GPIO mode pitfalls" for more details on this scenario. + The public pinmux API contains two functions named pinctrl_request_gpio() and pinctrl_free_gpio(). These two functions shall *ONLY* be called from gpiolib-based drivers as part of their gpio_request() and @@ -774,6 +781,111 @@ obtain the function "gpioN" where "N" is the global GPIO pin number if no special GPIO-handler is registered. +GPIO mode pitfalls +================== + +Sometime the developer may be confused by a datasheet talking about a pin +being possible to set into "GPIO mode". It appears that what hardware +engineers mean with "GPIO mode" is not necessarily the use case that is +implied in the kernel interface <linux/gpio.h>: a pin that you grab from +kernel code and then either listen for input or drive high/low to +assert/deassert some external line. + +Rather hardware engineers think that "GPIO mode" means that you can +software-control a few electrical properties of the pin that you would +not be able to control if the pin was in some other mode, such as muxed in +for a device. + +Example: a pin is usually muxed in to be used as a UART TX line. But during +system sleep, we need to put this pin into "GPIO mode" and ground it. + +If you make a 1-to-1 map to the GPIO subsystem for this pin, you may start +to think that you need to come up with something real complex, that the +pin shall be used for UART TX and GPIO at the same time, that you will grab +a pin control handle and set it to a certain state to enable UART TX to be +muxed in, then twist it over to GPIO mode and use gpio_direction_output() +to drive it low during sleep, then mux it over to UART TX again when you +wake up and maybe even gpio_request/gpio_free as part of this cycle. This +all gets very complicated. + +The solution is to not think that what the datasheet calls "GPIO mode" +has to be handled by the <linux/gpio.h> interface. Instead view this as +a certain pin config setting. Look in e.g. <linux/pinctrl/pinconf-generic.h> +and you find this in the documentation: + + PIN_CONFIG_OUTPUT: this will configure the pin in output, use argument + 1 to indicate high level, argument 0 to indicate low level. + +So it is perfectly possible to push a pin into "GPIO mode" and drive the +line low as part of the usual pin control map. So for example your UART +driver may look like this: + +#include <linux/pinctrl/consumer.h> + +struct pinctrl *pinctrl; +struct pinctrl_state *pins_default; +struct pinctrl_state *pins_sleep; + +pins_default = pinctrl_lookup_state(uap->pinctrl, PINCTRL_STATE_DEFAULT); +pins_sleep = pinctrl_lookup_state(uap->pinctrl, PINCTRL_STATE_SLEEP); + +/* Normal mode */ +retval = pinctrl_select_state(pinctrl, pins_default); +/* Sleep mode */ +retval = pinctrl_select_state(pinctrl, pins_sleep); + +And your machine configuration may look like this: +-------------------------------------------------- + +static unsigned long uart_default_mode[] = { + PIN_CONF_PACKED(PIN_CONFIG_DRIVE_PUSH_PULL, 0), +}; + +static unsigned long uart_sleep_mode[] = { + PIN_CONF_PACKED(PIN_CONFIG_OUTPUT, 0), +}; + +static struct pinctrl_map __initdata pinmap[] = { + PIN_MAP_MUX_GROUP("uart", PINCTRL_STATE_DEFAULT, "pinctrl-foo", + "u0_group", "u0"), + PIN_MAP_CONFIGS_PIN("uart", PINCTRL_STATE_DEFAULT, "pinctrl-foo", + "UART_TX_PIN", uart_default_mode), + PIN_MAP_MUX_GROUP("uart", PINCTRL_STATE_SLEEP, "pinctrl-foo", + "u0_group", "gpio-mode"), + PIN_MAP_CONFIGS_PIN("uart", PINCTRL_STATE_SLEEP, "pinctrl-foo", + "UART_TX_PIN", uart_sleep_mode), +}; + +foo_init(void) { + pinctrl_register_mappings(pinmap, ARRAY_SIZE(pinmap)); +} + +Here the pins we want to control are in the "u0_group" and there is some +function called "u0" that can be enabled on this group of pins, and then +everything is UART business as usual. But there is also some function +named "gpio-mode" that can be mapped onto the same pins to move them into +GPIO mode. + +This will give the desired effect without any bogus interaction with the +GPIO subsystem. It is just an electrical configuration used by that device +when going to sleep, it might imply that the pin is set into something the +datasheet calls "GPIO mode" but that is not the point: it is still used +by that UART device to control the pins that pertain to that very UART +driver, putting them into modes needed by the UART. GPIO in the Linux +kernel sense are just some 1-bit line, and is a different use case. + +How the registers are poked to attain the push/pull and output low +configuration and the muxing of the "u0" or "gpio-mode" group onto these +pins is a question for the driver. + +Some datasheets will be more helpful and refer to the "GPIO mode" as +"low power mode" rather than anything to do with GPIO. This often means +the same thing electrically speaking, but in this latter case the +software engineers will usually quickly identify that this is some +specific muxing/configuration rather than anything related to the GPIO +API. + + Board/machine configuration ================================== diff --git a/Documentation/power/opp.txt b/Documentation/power/opp.txt index 3035d00757ad..425c51d56aef 100644 --- a/Documentation/power/opp.txt +++ b/Documentation/power/opp.txt @@ -1,6 +1,5 @@ -*=============* -* OPP Library * -*=============* +Operating Performance Points (OPP) Library +========================================== (C) 2009-2010 Nishanth Menon <nm@ti.com>, Texas Instruments Incorporated @@ -16,15 +15,31 @@ Contents 1. Introduction =============== +1.1 What is an Operating Performance Point (OPP)? + Complex SoCs of today consists of a multiple sub-modules working in conjunction. In an operational system executing varied use cases, not all modules in the SoC need to function at their highest performing frequency all the time. To facilitate this, sub-modules in a SoC are grouped into domains, allowing some -domains to run at lower voltage and frequency while other domains are loaded -more. The set of discrete tuples consisting of frequency and voltage pairs that +domains to run at lower voltage and frequency while other domains run at +voltage/frequency pairs that are higher. + +The set of discrete tuples consisting of frequency and voltage pairs that the device will support per domain are called Operating Performance Points or OPPs. +As an example: +Let us consider an MPU device which supports the following: +{300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V}, +{1GHz at minimum voltage of 1.3V} + +We can represent these as three OPPs as the following {Hz, uV} tuples: +{300000000, 1000000} +{800000000, 1200000} +{1000000000, 1300000} + +1.2 Operating Performance Points Library + OPP library provides a set of helper functions to organize and query the OPP information. The library is located in drivers/base/power/opp.c and the header is located in include/linux/opp.h. OPP library can be enabled by enabling diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt index e8a6aa473bab..6e953564de03 100644 --- a/Documentation/printk-formats.txt +++ b/Documentation/printk-formats.txt @@ -170,5 +170,5 @@ Reminder: sizeof() result is of type size_t. Thank you for your cooperation and attention. -By Randy Dunlap <rdunlap@xenotime.net> and +By Randy Dunlap <rdunlap@infradead.org> and Andrew Murray <amurray@mpc-data.co.uk> diff --git a/Documentation/s390/s390dbf.txt b/Documentation/s390/s390dbf.txt index ae66f9b90a25..fcaf0b4efba2 100644 --- a/Documentation/s390/s390dbf.txt +++ b/Documentation/s390/s390dbf.txt @@ -143,7 +143,8 @@ Parameter: id: handle for debug log Return Value: none -Description: frees memory for a debug log +Description: frees memory for a debug log and removes all registered debug + views. Must not be called within an interrupt handler --------------------------------------------------------------------------- diff --git a/Documentation/scsi/LICENSE.qla2xxx b/Documentation/scsi/LICENSE.qla2xxx index 27a91cf43d6d..5020b7b5a244 100644 --- a/Documentation/scsi/LICENSE.qla2xxx +++ b/Documentation/scsi/LICENSE.qla2xxx @@ -1,4 +1,4 @@ -Copyright (c) 2003-2012 QLogic Corporation +Copyright (c) 2003-2013 QLogic Corporation QLogic Linux FC-FCoE Driver This program includes a device driver for Linux 3.x. diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index ce6581c8ca26..95731a08f257 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt @@ -890,9 +890,8 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. enable_msi - Enable Message Signaled Interrupt (MSI) (default = off) power_save - Automatic power-saving timeout (in second, 0 = disable) - power_save_controller - Support runtime D3 of HD-audio controller - (-1 = on for supported chip (default), false = off, - true = force to on even for unsupported hardware) + power_save_controller - Reset HD-audio controller in power-saving mode + (default = on) align_buffer_size - Force rounding of buffer/period sizes to multiples of 128 bytes. This is more efficient in terms of memory access but isn't required by the HDA spec and prevents @@ -912,7 +911,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. models depending on the codec chip. The list of available models is found in HD-Audio-Models.txt - The model name "genric" is treated as a special case. When this + The model name "generic" is treated as a special case. When this model is given, the driver uses the generic codec parser without "codec-patch". It's sometimes good for testing and debugging. diff --git a/Documentation/sound/alsa/seq_oss.html b/Documentation/sound/alsa/seq_oss.html index d9776cf60c07..9663b45f6fde 100644 --- a/Documentation/sound/alsa/seq_oss.html +++ b/Documentation/sound/alsa/seq_oss.html @@ -285,7 +285,7 @@ sample data. <H4> 7.2.4 Close Callback</H4> The <TT>close</TT> callback is called when this device is closed by the -applicaion. If any private data was allocated in open callback, it must +application. If any private data was allocated in open callback, it must be released in the close callback. The deletion of ALSA port should be done here, too. This callback must not be NULL. <H4> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 078701fdbd4d..dcc75a9ed919 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -18,6 +18,7 @@ files can be found in mm/swap.c. Currently, these files are in /proc/sys/vm: +- admin_reserve_kbytes - block_dump - compact_memory - dirty_background_bytes @@ -53,11 +54,41 @@ Currently, these files are in /proc/sys/vm: - percpu_pagelist_fraction - stat_interval - swappiness +- user_reserve_kbytes - vfs_cache_pressure - zone_reclaim_mode ============================================================== +admin_reserve_kbytes + +The amount of free memory in the system that should be reserved for users +with the capability cap_sys_admin. + +admin_reserve_kbytes defaults to min(3% of free pages, 8MB) + +That should provide enough for the admin to log in and kill a process, +if necessary, under the default overcommit 'guess' mode. + +Systems running under overcommit 'never' should increase this to account +for the full Virtual Memory Size of programs used to recover. Otherwise, +root may not be able to log in to recover the system. + +How do you calculate a minimum useful reserve? + +sshd or login + bash (or some other shell) + top (or ps, kill, etc.) + +For overcommit 'guess', we can sum resident set sizes (RSS). +On x86_64 this is about 8MB. + +For overcommit 'never', we can take the max of their virtual sizes (VSZ) +and add the sum of their RSS. +On x86_64 this is about 128MB. + +Changing this takes effect whenever an application requests memory. + +============================================================== + block_dump block_dump enables block I/O debugging when set to a nonzero value. More @@ -542,6 +573,7 @@ memory until it actually runs out. When this flag is 2, the kernel uses a "never overcommit" policy that attempts to prevent any overcommit of memory. +Note that user_reserve_kbytes affects this policy. This feature can be very useful because there are a lot of programs that malloc() huge amounts of memory "just-in-case" @@ -645,6 +677,24 @@ The default value is 60. ============================================================== +- user_reserve_kbytes + +When overcommit_memory is set to 2, "never overommit" mode, reserve +min(3% of current process size, user_reserve_kbytes) of free memory. +This is intended to prevent a user from starting a single memory hogging +process, such that they cannot recover (kill the hog). + +user_reserve_kbytes defaults to min(3% of the current process size, 128MB). + +If this is reduced to zero, then the user will be allowed to allocate +all free memory with a single process, minus admin_reserve_kbytes. +Any subsequent attempts to execute a command will result in +"fork: Cannot allocate memory". + +Changing this takes effect whenever an application requests memory. + +============================================================== + vfs_cache_pressure ------------------ diff --git a/Documentation/this_cpu_ops.txt b/Documentation/this_cpu_ops.txt new file mode 100644 index 000000000000..1a4ce7e3e05f --- /dev/null +++ b/Documentation/this_cpu_ops.txt @@ -0,0 +1,205 @@ +this_cpu operations +------------------- + +this_cpu operations are a way of optimizing access to per cpu +variables associated with the *currently* executing processor through +the use of segment registers (or a dedicated register where the cpu +permanently stored the beginning of the per cpu area for a specific +processor). + +The this_cpu operations add a per cpu variable offset to the processor +specific percpu base and encode that operation in the instruction +operating on the per cpu variable. + +This means there are no atomicity issues between the calculation of +the offset and the operation on the data. Therefore it is not +necessary to disable preempt or interrupts to ensure that the +processor is not changed between the calculation of the address and +the operation on the data. + +Read-modify-write operations are of particular interest. Frequently +processors have special lower latency instructions that can operate +without the typical synchronization overhead but still provide some +sort of relaxed atomicity guarantee. The x86 for example can execute +RMV (Read Modify Write) instructions like inc/dec/cmpxchg without the +lock prefix and the associated latency penalty. + +Access to the variable without the lock prefix is not synchronized but +synchronization is not necessary since we are dealing with per cpu +data specific to the currently executing processor. Only the current +processor should be accessing that variable and therefore there are no +concurrency issues with other processors in the system. + +On x86 the fs: or the gs: segment registers contain the base of the +per cpu area. It is then possible to simply use the segment override +to relocate a per cpu relative address to the proper per cpu area for +the processor. So the relocation to the per cpu base is encoded in the +instruction via a segment register prefix. + +For example: + + DEFINE_PER_CPU(int, x); + int z; + + z = this_cpu_read(x); + +results in a single instruction + + mov ax, gs:[x] + +instead of a sequence of calculation of the address and then a fetch +from that address which occurs with the percpu operations. Before +this_cpu_ops such sequence also required preempt disable/enable to +prevent the kernel from moving the thread to a different processor +while the calculation is performed. + +The main use of the this_cpu operations has been to optimize counter +operations. + + this_cpu_inc(x) + +results in the following single instruction (no lock prefix!) + + inc gs:[x] + +instead of the following operations required if there is no segment +register. + + int *y; + int cpu; + + cpu = get_cpu(); + y = per_cpu_ptr(&x, cpu); + (*y)++; + put_cpu(); + +Note that these operations can only be used on percpu data that is +reserved for a specific processor. Without disabling preemption in the +surrounding code this_cpu_inc() will only guarantee that one of the +percpu counters is correctly incremented. However, there is no +guarantee that the OS will not move the process directly before or +after the this_cpu instruction is executed. In general this means that +the value of the individual counters for each processor are +meaningless. The sum of all the per cpu counters is the only value +that is of interest. + +Per cpu variables are used for performance reasons. Bouncing cache +lines can be avoided if multiple processors concurrently go through +the same code paths. Since each processor has its own per cpu +variables no concurrent cacheline updates take place. The price that +has to be paid for this optimization is the need to add up the per cpu +counters when the value of the counter is needed. + + +Special operations: +------------------- + + y = this_cpu_ptr(&x) + +Takes the offset of a per cpu variable (&x !) and returns the address +of the per cpu variable that belongs to the currently executing +processor. this_cpu_ptr avoids multiple steps that the common +get_cpu/put_cpu sequence requires. No processor number is +available. Instead the offset of the local per cpu area is simply +added to the percpu offset. + + + +Per cpu variables and offsets +----------------------------- + +Per cpu variables have *offsets* to the beginning of the percpu +area. They do not have addresses although they look like that in the +code. Offsets cannot be directly dereferenced. The offset must be +added to a base pointer of a percpu area of a processor in order to +form a valid address. + +Therefore the use of x or &x outside of the context of per cpu +operations is invalid and will generally be treated like a NULL +pointer dereference. + +In the context of per cpu operations + + x is a per cpu variable. Most this_cpu operations take a cpu + variable. + + &x is the *offset* a per cpu variable. this_cpu_ptr() takes + the offset of a per cpu variable which makes this look a bit + strange. + + + +Operations on a field of a per cpu structure +-------------------------------------------- + +Let's say we have a percpu structure + + struct s { + int n,m; + }; + + DEFINE_PER_CPU(struct s, p); + + +Operations on these fields are straightforward + + this_cpu_inc(p.m) + + z = this_cpu_cmpxchg(p.m, 0, 1); + + +If we have an offset to struct s: + + struct s __percpu *ps = &p; + + z = this_cpu_dec(ps->m); + + z = this_cpu_inc_return(ps->n); + + +The calculation of the pointer may require the use of this_cpu_ptr() +if we do not make use of this_cpu ops later to manipulate fields: + + struct s *pp; + + pp = this_cpu_ptr(&p); + + pp->m--; + + z = pp->n++; + + +Variants of this_cpu ops +------------------------- + +this_cpu ops are interrupt safe. Some architecture do not support +these per cpu local operations. In that case the operation must be +replaced by code that disables interrupts, then does the operations +that are guaranteed to be atomic and then reenable interrupts. Doing +so is expensive. If there are other reasons why the scheduler cannot +change the processor we are executing on then there is no reason to +disable interrupts. For that purpose the __this_cpu operations are +provided. For example. + + __this_cpu_inc(x); + +Will increment x and will not fallback to code that disables +interrupts on platforms that cannot accomplish atomicity through +address relocation and a Read-Modify-Write operation in the same +instruction. + + + +&this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n) +-------------------------------------------- + +The first operation takes the offset and forms an address and then +adds the offset of the n field. + +The second one first adds the two offsets and then does the +relocation. IMHO the second form looks cleaner and has an easier time +with (). The second form also is consistent with the way +this_cpu_read() and friends are used. + + +Christoph Lameter, April 3rd, 2013 diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 53d6a3c51d87..bfe8c29b1f1d 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -8,6 +8,7 @@ Copyright 2008 Red Hat Inc. Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, John Kacur, and David Teigland. Written for: 2.6.28-rc2 +Updated for: 3.10 Introduction ------------ @@ -17,13 +18,16 @@ designers of systems to find what is going on inside the kernel. It can be used for debugging or analyzing latencies and performance issues that take place outside of user-space. -Although ftrace is the function tracer, it also includes an -infrastructure that allows for other types of tracing. Some of -the tracers that are currently in ftrace include a tracer to -trace context switches, the time it takes for a high priority -task to run after it was woken up, the time interrupts are -disabled, and more (ftrace allows for tracer plugins, which -means that the list of tracers can always grow). +Although ftrace is typically considered the function tracer, it +is really a frame work of several assorted tracing utilities. +There's latency tracing to examine what occurs between interrupts +disabled and enabled, as well as for preemption and from a time +a task is woken to the task is actually scheduled in. + +One of the most common uses of ftrace is the event tracing. +Through out the kernel is hundreds of static event points that +can be enabled via the debugfs file system to see what is +going on in certain parts of the kernel. Implementation Details @@ -61,7 +65,7 @@ the extended "/sys/kernel/debug/tracing" path name. That's it! (assuming that you have ftrace configured into your kernel) -After mounting the debugfs, you can see a directory called +After mounting debugfs, you can see a directory called "tracing". This directory contains the control and output files of ftrace. Here is a list of some of the key files: @@ -84,7 +88,9 @@ of ftrace. Here is a list of some of the key files: This sets or displays whether writing to the trace ring buffer is enabled. Echo 0 into this file to disable - the tracer or 1 to enable it. + the tracer or 1 to enable it. Note, this only disables + writing to the ring buffer, the tracing overhead may + still be occurring. trace: @@ -109,7 +115,15 @@ of ftrace. Here is a list of some of the key files: This file lets the user control the amount of data that is displayed in one of the above output - files. + files. Options also exist to modify how a tracer + or events work (stack traces, timestamps, etc). + + options: + + This is a directory that has a file for every available + trace option (also in trace_options). Options may also be set + or cleared by writing a "1" or "0" respectively into the + corresponding file with the option name. tracing_max_latency: @@ -121,10 +135,17 @@ of ftrace. Here is a list of some of the key files: latency is greater than the value in this file. (in microseconds) + tracing_thresh: + + Some latency tracers will record a trace whenever the + latency is greater than the number in this file. + Only active when the file contains a number greater than 0. + (in microseconds) + buffer_size_kb: This sets or displays the number of kilobytes each CPU - buffer can hold. The tracer buffers are the same size + buffer holds. By default, the trace buffers are the same size for each CPU. The displayed number is the size of the CPU buffer and not total size of all buffers. The trace buffers are allocated in pages (blocks of memory @@ -133,16 +154,30 @@ of ftrace. Here is a list of some of the key files: than requested, the rest of the page will be used, making the actual allocation bigger than requested. ( Note, the size may not be a multiple of the page size - due to buffer management overhead. ) + due to buffer management meta-data. ) - This can only be updated when the current_tracer - is set to "nop". + buffer_total_size_kb: + + This displays the total combined size of all the trace buffers. + + free_buffer: + + If a process is performing the tracing, and the ring buffer + should be shrunk "freed" when the process is finished, even + if it were to be killed by a signal, this file can be used + for that purpose. On close of this file, the ring buffer will + be resized to its minimum size. Having a process that is tracing + also open this file, when the process exits its file descriptor + for this file will be closed, and in doing so, the ring buffer + will be "freed". + + It may also stop tracing if disable_on_free option is set. tracing_cpumask: This is a mask that lets the user only trace - on specified CPUS. The format is a hex string - representing the CPUS. + on specified CPUs. The format is a hex string + representing the CPUs. set_ftrace_filter: @@ -183,6 +218,261 @@ of ftrace. Here is a list of some of the key files: "set_ftrace_notrace". (See the section "dynamic ftrace" below for more details.) + enabled_functions: + + This file is more for debugging ftrace, but can also be useful + in seeing if any function has a callback attached to it. + Not only does the trace infrastructure use ftrace function + trace utility, but other subsystems might too. This file + displays all functions that have a callback attached to them + as well as the number of callbacks that have been attached. + Note, a callback may also call multiple functions which will + not be listed in this count. + + If the callback registered to be traced by a function with + the "save regs" attribute (thus even more overhead), a 'R' + will be displayed on the same line as the function that + is returning registers. + + function_profile_enabled: + + When set it will enable all functions with either the function + tracer, or if enabled, the function graph tracer. It will + keep a histogram of the number of functions that were called + and if run with the function graph tracer, it will also keep + track of the time spent in those functions. The histogram + content can be displayed in the files: + + trace_stats/function<cpu> ( function0, function1, etc). + + trace_stats: + + A directory that holds different tracing stats. + + kprobe_events: + + Enable dynamic trace points. See kprobetrace.txt. + + kprobe_profile: + + Dynamic trace points stats. See kprobetrace.txt. + + max_graph_depth: + + Used with the function graph tracer. This is the max depth + it will trace into a function. Setting this to a value of + one will show only the first kernel function that is called + from user space. + + printk_formats: + + This is for tools that read the raw format files. If an event in + the ring buffer references a string (currently only trace_printk() + does this), only a pointer to the string is recorded into the buffer + and not the string itself. This prevents tools from knowing what + that string was. This file displays the string and address for + the string allowing tools to map the pointers to what the + strings were. + + saved_cmdlines: + + Only the pid of the task is recorded in a trace event unless + the event specifically saves the task comm as well. Ftrace + makes a cache of pid mappings to comms to try to display + comms for events. If a pid for a comm is not listed, then + "<...>" is displayed in the output. + + snapshot: + + This displays the "snapshot" buffer and also lets the user + take a snapshot of the current running trace. + See the "Snapshot" section below for more details. + + stack_max_size: + + When the stack tracer is activated, this will display the + maximum stack size it has encountered. + See the "Stack Trace" section below. + + stack_trace: + + This displays the stack back trace of the largest stack + that was encountered when the stack tracer is activated. + See the "Stack Trace" section below. + + stack_trace_filter: + + This is similar to "set_ftrace_filter" but it limits what + functions the stack tracer will check. + + trace_clock: + + Whenever an event is recorded into the ring buffer, a + "timestamp" is added. This stamp comes from a specified + clock. By default, ftrace uses the "local" clock. This + clock is very fast and strictly per cpu, but on some + systems it may not be monotonic with respect to other + CPUs. In other words, the local clocks may not be in sync + with local clocks on other CPUs. + + Usual clocks for tracing: + + # cat trace_clock + [local] global counter x86-tsc + + local: Default clock, but may not be in sync across CPUs + + global: This clock is in sync with all CPUs but may + be a bit slower than the local clock. + + counter: This is not a clock at all, but literally an atomic + counter. It counts up one by one, but is in sync + with all CPUs. This is useful when you need to + know exactly the order events occurred with respect to + each other on different CPUs. + + uptime: This uses the jiffies counter and the time stamp + is relative to the time since boot up. + + perf: This makes ftrace use the same clock that perf uses. + Eventually perf will be able to read ftrace buffers + and this will help out in interleaving the data. + + x86-tsc: Architectures may define their own clocks. For + example, x86 uses its own TSC cycle clock here. + + To set a clock, simply echo the clock name into this file. + + echo global > trace_clock + + trace_marker: + + This is a very useful file for synchronizing user space + with events happening in the kernel. Writing strings into + this file will be written into the ftrace buffer. + + It is useful in applications to open this file at the start + of the application and just reference the file descriptor + for the file. + + void trace_write(const char *fmt, ...) + { + va_list ap; + char buf[256]; + int n; + + if (trace_fd < 0) + return; + + va_start(ap, fmt); + n = vsnprintf(buf, 256, fmt, ap); + va_end(ap); + + write(trace_fd, buf, n); + } + + start: + + trace_fd = open("trace_marker", WR_ONLY); + + uprobe_events: + + Add dynamic tracepoints in programs. + See uprobetracer.txt + + uprobe_profile: + + Uprobe statistics. See uprobetrace.txt + + instances: + + This is a way to make multiple trace buffers where different + events can be recorded in different buffers. + See "Instances" section below. + + events: + + This is the trace event directory. It holds event tracepoints + (also known as static tracepoints) that have been compiled + into the kernel. It shows what event tracepoints exist + and how they are grouped by system. There are "enable" + files at various levels that can enable the tracepoints + when a "1" is written to them. + + See events.txt for more information. + + per_cpu: + + This is a directory that contains the trace per_cpu information. + + per_cpu/cpu0/buffer_size_kb: + + The ftrace buffer is defined per_cpu. That is, there's a separate + buffer for each CPU to allow writes to be done atomically, + and free from cache bouncing. These buffers may have different + size buffers. This file is similar to the buffer_size_kb + file, but it only displays or sets the buffer size for the + specific CPU. (here cpu0). + + per_cpu/cpu0/trace: + + This is similar to the "trace" file, but it will only display + the data specific for the CPU. If written to, it only clears + the specific CPU buffer. + + per_cpu/cpu0/trace_pipe + + This is similar to the "trace_pipe" file, and is a consuming + read, but it will only display (and consume) the data specific + for the CPU. + + per_cpu/cpu0/trace_pipe_raw + + For tools that can parse the ftrace ring buffer binary format, + the trace_pipe_raw file can be used to extract the data + from the ring buffer directly. With the use of the splice() + system call, the buffer data can be quickly transferred to + a file or to the network where a server is collecting the + data. + + Like trace_pipe, this is a consuming reader, where multiple + reads will always produce different data. + + per_cpu/cpu0/snapshot: + + This is similar to the main "snapshot" file, but will only + snapshot the current CPU (if supported). It only displays + the content of the snapshot for a given CPU, and if + written to, only clears this CPU buffer. + + per_cpu/cpu0/snapshot_raw: + + Similar to the trace_pipe_raw, but will read the binary format + from the snapshot buffer for the given CPU. + + per_cpu/cpu0/stats: + + This displays certain stats about the ring buffer: + + entries: The number of events that are still in the buffer. + + overrun: The number of lost events due to overwriting when + the buffer was full. + + commit overrun: Should always be zero. + This gets set if so many events happened within a nested + event (ring buffer is re-entrant), that it fills the + buffer and starts dropping events. + + bytes: Bytes actually read (not overwritten). + + oldest event ts: The oldest timestamp in the buffer + + now ts: The current timestamp + + dropped events: Events lost due to overwrite option being off. + + read events: The number of events read. The Tracers ----------- @@ -234,11 +524,6 @@ Here is the list of current tracers that may be configured. RT tasks (as the current "wakeup" does). This is useful for those interested in wake up timings of RT tasks. - "hw-branch-tracer" - - Uses the BTS CPU feature on x86 CPUs to traces all - branches executed. - "nop" This is the "trace nothing" tracer. To remove all @@ -261,70 +546,100 @@ Here is an example of the output format of the file "trace" -------- # tracer: function # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4251 [01] 10152.583854: path_put <-path_walk - bash-4251 [01] 10152.583855: dput <-path_put - bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput +# entries-in-buffer/entries-written: 140080/250280 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath + bash-1977 [000] .... 17284.993653: __close_fd <-sys_close + bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd + sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify + bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock + bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd + bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock + bash-1977 [000] .... 17284.993657: filp_close <-__close_fd + bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close + sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath -------- A header is printed with the tracer name that is represented by -the trace. In this case the tracer is "function". Then a header -showing the format. Task name "bash", the task PID "4251", the -CPU that it was running on "01", the timestamp in <secs>.<usecs> -format, the function name that was traced "path_put" and the -parent function that called this function "path_walk". The -timestamp is the time at which the function was entered. +the trace. In this case the tracer is "function". Then it shows the +number of events in the buffer as well as the total number of entries +that were written. The difference is the number of entries that were +lost due to the buffer filling up (250280 - 140080 = 110200 events +lost). + +The header explains the content of the events. Task name "bash", the task +PID "1977", the CPU that it was running on "000", the latency format +(explained below), the timestamp in <secs>.<usecs> format, the +function name that was traced "sys_close" and the parent function that +called this function "system_call_fastpath". The timestamp is the time +at which the function was entered. Latency trace format -------------------- -When the latency-format option is enabled, the trace file gives -somewhat more information to see why a latency happened. -Here is a typical trace. +When the latency-format option is enabled or when one of the latency +tracers is set, the trace file gives somewhat more information to see +why a latency happened. Here is a typical trace. # tracer: irqsoff # -irqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: apic_timer_interrupt - => ended at: do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt) - <idle>-0 0d.s. 97us : __do_softirq (do_softirq) - <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) +# irqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: __lock_task_sighand +# => ended at: _raw_spin_unlock_irqrestore +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand + ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore + ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore + ps-6143 2d..1 306us : <stack trace> + => trace_hardirqs_on_caller + => trace_hardirqs_on + => _raw_spin_unlock_irqrestore + => do_task_stat + => proc_tgid_stat + => proc_single_show + => seq_read + => vfs_read + => sys_read + => system_call_fastpath This shows that the current tracer is "irqsoff" tracing the time -for which interrupts were disabled. It gives the trace version -and the version of the kernel upon which this was executed on -(2.6.26-rc8). Then it displays the max latency in microsecs (97 -us). The number of trace entries displayed and the total number -recorded (both are three: #3/3). The type of preemption that was -used (PREEMPT). VP, KP, SP, and HP are always zero and are -reserved for later use. #P is the number of online CPUS (#P:2). +for which interrupts were disabled. It gives the trace version (which +never changes) and the version of the kernel upon which this was executed on +(3.10). Then it displays the max latency in microseconds (259 us). The number +of trace entries displayed and the total number (both are four: #4/4). +VP, KP, SP, and HP are always zero and are reserved for later use. +#P is the number of online CPUs (#P:4). The task is the process that was running when the latency -occurred. (swapper pid: 0). +occurred. (ps pid: 6143). The start and stop (the functions in which the interrupts were disabled and enabled respectively) that caused the latencies: - apic_timer_interrupt is where the interrupts were disabled. - do_softirq is where they were enabled again. + __lock_task_sighand is where the interrupts were disabled. + _raw_spin_unlock_irqrestore is where they were enabled again. The next lines after the header are the trace itself. The header explains which is which. @@ -367,16 +682,43 @@ The above is mostly meaningful for kernel developers. The rest is the same as the 'trace' file. + Note, the latency tracers will usually end with a back trace + to easily find where the latency occurred. trace_options ------------- -The trace_options file is used to control what gets printed in -the trace output. To see what is available, simply cat the file: +The trace_options file (or the options directory) is used to control +what gets printed in the trace output, or manipulate the tracers. +To see what is available, simply cat the file: cat trace_options - print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ - noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj +print-parent +nosym-offset +nosym-addr +noverbose +noraw +nohex +nobin +noblock +nostacktrace +trace_printk +noftrace_preempt +nobranch +annotate +nouserstacktrace +nosym-userobj +noprintk-msg-only +context-info +latency-format +sleep-time +graph-time +record-cmd +overwrite +nodisable_on_free +irq-info +markers +function-trace To disable one of the options, echo in the option prepended with "no". @@ -428,13 +770,34 @@ Here are the available options: bin - This will print out the formats in raw binary. - block - TBD (needs update) + block - When set, reading trace_pipe will not block when polled. stacktrace - This is one of the options that changes the trace itself. When a trace is recorded, so is the stack of functions. This allows for back traces of trace sites. + trace_printk - Can disable trace_printk() from writing into the buffer. + + branch - Enable branch tracing with the tracer. + + annotate - It is sometimes confusing when the CPU buffers are full + and one CPU buffer had a lot of events recently, thus + a shorter time frame, were another CPU may have only had + a few events, which lets it have older events. When + the trace is reported, it shows the oldest events first, + and it may look like only one CPU ran (the one with the + oldest events). When the annotate option is set, it will + display when a new CPU buffer started: + + <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on + <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on + <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore +##### CPU 2 buffer started #### + <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle + <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog + <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock + userstacktrace - This option changes the trace. It records a stacktrace of the current userspace thread. @@ -451,9 +814,13 @@ Here are the available options: a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] - sched-tree - trace all tasks that are on the runqueue, at - every scheduling event. Will add overhead if - there's a lot of tasks running at once. + + printk-msg-only - When set, trace_printk()s will only show the format + and not their parameters (if trace_bprintk() or + trace_bputs() was used to save the trace_printk()). + + context-info - Show only the event data. Hides the comm, PID, + timestamp, CPU, and other useful data. latency-format - This option changes the trace. When it is enabled, the trace displays @@ -461,31 +828,61 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] latencies, as described in "Latency trace format". + sleep-time - When running function graph tracer, to include + the time a task schedules out in its function. + When enabled, it will account time the task has been + scheduled out as part of the function call. + + graph-time - When running function graph tracer, to include the + time to call nested functions. When this is not set, + the time reported for the function will only include + the time the function itself executed for, not the time + for functions that it called. + + record-cmd - When any event or tracer is enabled, a hook is enabled + in the sched_switch trace point to fill comm cache + with mapped pids and comms. But this may cause some + overhead, and if you only care about pids, and not the + name of the task, disabling this option can lower the + impact of tracing. + overwrite - This controls what happens when the trace buffer is full. If "1" (default), the oldest events are discarded and overwritten. If "0", then the newest events are discarded. + (see per_cpu/cpu0/stats for overrun and dropped) -ftrace_enabled --------------- + disable_on_free - When the free_buffer is closed, tracing will + stop (tracing_on set to 0). -The following tracers (listed below) give different output -depending on whether or not the sysctl ftrace_enabled is set. To -set ftrace_enabled, one can either use the sysctl function or -set it via the proc file system interface. + irq-info - Shows the interrupt, preempt count, need resched data. + When disabled, the trace looks like: - sysctl kernel.ftrace_enabled=1 +# tracer: function +# +# entries-in-buffer/entries-written: 144405/9452052 #P:4 +# +# TASK-PID CPU# TIMESTAMP FUNCTION +# | | | | | + <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up + <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89 + <idle>-0 [002] 23636.756055: enqueue_task <-activate_task - or - echo 1 > /proc/sys/kernel/ftrace_enabled + markers - When set, the trace_marker is writable (only by root). + When disabled, the trace_marker will error with EINVAL + on write. + + + function-trace - The latency tracers will enable function tracing + if this option is enabled (default it is). When + it is disabled, the latency tracers do not trace + functions. This keeps the overhead of the tracer down + when performing latency tests. -To disable ftrace_enabled simply replace the '1' with '0' in the -above commands. + Note: Some tracers have their own options. They only appear + when the tracer is active. -When ftrace_enabled is set the tracers will also record the -functions that are within the trace. The descriptions of the -tracers will also show an example with ftrace enabled. irqsoff @@ -506,95 +903,133 @@ new trace is saved. To reset the maximum, echo 0 into tracing_max_latency. Here is an example: + # echo 0 > options/function-trace # echo irqsoff > current_tracer - # echo latency-format > trace_options - # echo 0 > tracing_max_latency # echo 1 > tracing_on + # echo 0 > tracing_max_latency # ls -ltr [...] # echo 0 > tracing_on # cat trace # tracer: irqsoff # -irqsoff latency trace v1.1.5 on 2.6.26 --------------------------------------------------------------------- - latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: sys_setpgid - => ended at: sys_setpgid - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - bash-3730 1d... 0us : _write_lock_irq (sys_setpgid) - bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid) - bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid) - - -Here we see that that we had a latency of 12 microsecs (which is -very good). The _write_lock_irq in sys_setpgid disabled -interrupts. The difference between the 12 and the displayed -timestamp 14us occurred because the clock was incremented +# irqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: run_timer_softirq +# => ended at: run_timer_softirq +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq + <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq + <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq + <idle>-0 0dNs3 25us : <stack trace> + => _raw_spin_unlock_irq + => run_timer_softirq + => __do_softirq + => call_softirq + => do_softirq + => irq_exit + => smp_apic_timer_interrupt + => apic_timer_interrupt + => rcu_idle_exit + => cpu_idle + => rest_init + => start_kernel + => x86_64_start_reservations + => x86_64_start_kernel + +Here we see that that we had a latency of 16 microseconds (which is +very good). The _raw_spin_lock_irq in run_timer_softirq disabled +interrupts. The difference between the 16 and the displayed +timestamp 25us occurred because the clock was incremented between the time of recording the max latency and the time of recording the function that had that latency. -Note the above example had ftrace_enabled not set. If we set the -ftrace_enabled, we get a much larger output: +Note the above example had function-trace not set. If we set +function-trace, we get a much larger output: + + with echo 1 > options/function-trace # tracer: irqsoff # -irqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: __alloc_pages_internal - => ended at: __alloc_pages_internal - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal) - ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist) - ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk) - ls-4339 0d..1 4us : add_preempt_count (_spin_lock) - ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue) +# irqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: ata_scsi_queuecmd +# => ended at: ata_scsi_queuecmd +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd + bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave + bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd + bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev + bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev + bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd + bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd + bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd + bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat [...] - ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk) - ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock) - ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal) - ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal) - - - -Here we traced a 50 microsecond latency. But we also see all the + bash-2042 3d..1 67us : delay_tsc <-__delay + bash-2042 3d..1 67us : add_preempt_count <-delay_tsc + bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc + bash-2042 3d..1 67us : add_preempt_count <-delay_tsc + bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc + bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue + bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd + bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd + bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd + bash-2042 3d..1 120us : <stack trace> + => _raw_spin_unlock_irqrestore + => ata_scsi_queuecmd + => scsi_dispatch_cmd + => scsi_request_fn + => __blk_run_queue_uncond + => __blk_run_queue + => blk_queue_bio + => generic_make_request + => submit_bio + => submit_bh + => __ext3_get_inode_loc + => ext3_iget + => ext3_lookup + => lookup_real + => __lookup_hash + => walk_component + => lookup_last + => path_lookupat + => filename_lookup + => user_path_at_empty + => user_path_at + => vfs_fstatat + => vfs_stat + => sys_newstat + => system_call_fastpath + + +Here we traced a 71 microsecond latency. But we also see all the functions that were called during that time. Note that by enabling function tracing, we incur an added overhead. This overhead may extend the latency times. But nevertheless, this @@ -614,120 +1049,122 @@ Like the irqsoff tracer, it records the maximum latency for which preemption was disabled. The control of preemptoff tracer is much like the irqsoff tracer. + # echo 0 > options/function-trace # echo preemptoff > current_tracer - # echo latency-format > trace_options - # echo 0 > tracing_max_latency # echo 1 > tracing_on + # echo 0 > tracing_max_latency # ls -ltr [...] # echo 0 > tracing_on # cat trace # tracer: preemptoff # -preemptoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: do_IRQ - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - sshd-4261 0d.h. 0us+: irq_enter (do_IRQ) - sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) +# preemptoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: do_IRQ +# => ended at: do_IRQ +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ + sshd-1991 1d..1 46us : irq_exit <-do_IRQ + sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ + sshd-1991 1d..1 52us : <stack trace> + => sub_preempt_count + => irq_exit + => do_IRQ + => ret_from_intr This has some more changes. Preemption was disabled when an -interrupt came in (notice the 'h'), and was enabled while doing -a softirq. (notice the 's'). But we also see that interrupts -have been disabled when entering the preempt off section and -leaving it (the 'd'). We do not know if interrupts were enabled -in the mean time. +interrupt came in (notice the 'h'), and was enabled on exit. +But we also see that interrupts have been disabled when entering +the preempt off section and leaving it (the 'd'). We do not know if +interrupts were enabled in the mean time or shortly after this +was over. # tracer: preemptoff # -preemptoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: remove_wait_queue - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue) - sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue) - sshd-4261 0d..1 2us : do_IRQ (common_interrupt) - sshd-4261 0d..1 2us : irq_enter (do_IRQ) - sshd-4261 0d..1 2us : idle_cpu (irq_enter) - sshd-4261 0d..1 3us : add_preempt_count (irq_enter) - sshd-4261 0d.h1 3us : idle_cpu (irq_enter) - sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ) +# preemptoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: wake_up_new_task +# => ended at: task_rq_unlock +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task + bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq + bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair + bash-1994 1d..1 1us : source_load <-select_task_rq_fair + bash-1994 1d..1 1us : source_load <-select_task_rq_fair [...] - sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock) - sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq) - sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq) - sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq) - sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock) - sshd-4261 0d.h1 14us : irq_exit (do_IRQ) - sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit) - sshd-4261 0d..2 15us : do_softirq (irq_exit) - sshd-4261 0d... 15us : __do_softirq (do_softirq) - sshd-4261 0d... 16us : __local_bh_disable (__do_softirq) - sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable) - sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable) - sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable) + bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt + bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter + bash-1994 1d..1 13us : add_preempt_count <-irq_enter + bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt + bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt + bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt + bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock + bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt [...] - sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable) - sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable) - sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable) + bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event + bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt + bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit + bash-1994 1d..2 36us : do_softirq <-irq_exit + bash-1994 1d..2 36us : __do_softirq <-call_softirq + bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq + bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq + bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq + bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock + bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq [...] - sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) + bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks + bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq + bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable + bash-1994 1dN.2 82us : idle_cpu <-irq_exit + bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit + bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit + bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock + bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock + bash-1994 1.N.1 104us : <stack trace> + => sub_preempt_count + => _raw_spin_unlock_irqrestore + => task_rq_unlock + => wake_up_new_task + => do_fork + => sys_clone + => stub_clone The above is an example of the preemptoff trace with -ftrace_enabled set. Here we see that interrupts were disabled +function-trace set. Here we see that interrupts were not disabled the entire time. The irq_enter code lets us know that we entered an interrupt 'h'. Before that, the functions being traced still show that it is not in an interrupt, but we can see from the functions themselves that this is not the case. -Notice that __do_softirq when called does not have a -preempt_count. It may seem that we missed a preempt enabling. -What really happened is that the preempt count is held on the -thread's stack and we switched to the softirq stack (4K stacks -in effect). The code does not copy the preempt count, but -because interrupts are disabled, we do not need to worry about -it. Having a tracer like this is good for letting people know -what really happens inside the kernel. - - preemptirqsoff -------------- @@ -762,38 +1199,57 @@ tracer. Again, using this trace is much like the irqsoff and preemptoff tracers. + # echo 0 > options/function-trace # echo preemptirqsoff > current_tracer - # echo latency-format > trace_options - # echo 0 > tracing_max_latency # echo 1 > tracing_on + # echo 0 > tracing_max_latency # ls -ltr [...] # echo 0 > tracing_on # cat trace # tracer: preemptirqsoff # -preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: apic_timer_interrupt - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt) - ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq) - ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq) - +# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: ata_scsi_queuecmd +# => ended at: ata_scsi_queuecmd +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd + ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd + ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd + ls-2230 3...1 111us : <stack trace> + => sub_preempt_count + => _raw_spin_unlock_irqrestore + => ata_scsi_queuecmd + => scsi_dispatch_cmd + => scsi_request_fn + => __blk_run_queue_uncond + => __blk_run_queue + => blk_queue_bio + => generic_make_request + => submit_bio + => submit_bh + => ext3_bread + => ext3_dir_bread + => htree_dirblock_to_tree + => ext3_htree_fill_tree + => ext3_readdir + => vfs_readdir + => sys_getdents + => system_call_fastpath The trace_hardirqs_off_thunk is called from assembly on x86 when @@ -802,105 +1258,158 @@ function tracing, we do not know if interrupts were enabled within the preemption points. We do see that it started with preemption enabled. -Here is a trace with ftrace_enabled set: - +Here is a trace with function-trace set: # tracer: preemptirqsoff # -preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: write_chan - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - ls-4473 0.N.. 0us : preempt_schedule (write_chan) - ls-4473 0dN.1 1us : _spin_lock (schedule) - ls-4473 0dN.1 2us : add_preempt_count (_spin_lock) - ls-4473 0d..2 2us : put_prev_task_fair (schedule) -[...] - ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts) - ls-4473 0d..2 13us : __switch_to (schedule) - sshd-4261 0d..2 14us : finish_task_switch (schedule) - sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch) - sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave) - sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set) - sshd-4261 0d..2 16us : do_IRQ (common_interrupt) - sshd-4261 0d..2 17us : irq_enter (do_IRQ) - sshd-4261 0d..2 17us : idle_cpu (irq_enter) - sshd-4261 0d..2 18us : add_preempt_count (irq_enter) - sshd-4261 0d.h2 18us : idle_cpu (irq_enter) - sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ) - sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq) - sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock) - sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq) - sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock) -[...] - sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq) - sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock) - sshd-4261 0d.h2 29us : irq_exit (do_IRQ) - sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit) - sshd-4261 0d..3 30us : do_softirq (irq_exit) - sshd-4261 0d... 30us : __do_softirq (do_softirq) - sshd-4261 0d... 31us : __local_bh_disable (__do_softirq) - sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable) - sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable) +# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: schedule +# => ended at: mutex_unlock +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / +kworker/-59 3...1 0us : __schedule <-schedule +kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch +kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq +kworker/-59 3d..2 1us : deactivate_task <-__schedule +kworker/-59 3d..2 1us : dequeue_task <-deactivate_task +kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task +kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task +kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair +kworker/-59 3d..2 2us : update_min_vruntime <-update_curr +kworker/-59 3d..2 3us : cpuacct_charge <-update_curr +kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge +kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge +kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair +kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair +kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair +kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair +kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair +kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair +kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule +kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping +kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule +kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task +kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair +kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair +kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity + ls-2269 3d..2 7us : finish_task_switch <-__schedule + ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch + ls-2269 3d..2 8us : do_IRQ <-ret_from_intr + ls-2269 3d..2 8us : irq_enter <-do_IRQ + ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter + ls-2269 3d..2 9us : add_preempt_count <-irq_enter + ls-2269 3d.h2 9us : exit_idle <-do_IRQ [...] - sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt) - sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt) - sshd-4261 0d.s3 45us : idle_cpu (irq_enter) - sshd-4261 0d.s3 46us : add_preempt_count (irq_enter) - sshd-4261 0d.H3 46us : idle_cpu (irq_enter) - sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt) - sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt) + ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock + ls-2269 3d.h2 20us : irq_exit <-do_IRQ + ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit + ls-2269 3d..3 21us : do_softirq <-irq_exit + ls-2269 3d..3 21us : __do_softirq <-call_softirq + ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq + ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip + ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip + ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr + ls-2269 3d.s5 31us : irq_enter <-do_IRQ + ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter [...] - sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt) - sshd-4261 0d.H3 82us : ktime_get (tick_program_event) - sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get) - sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts) - sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts) - sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event) - sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event) - sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt) - sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit) - sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit) - sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable) + ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter + ls-2269 3d.s5 32us : add_preempt_count <-irq_enter + ls-2269 3d.H5 32us : exit_idle <-do_IRQ + ls-2269 3d.H5 32us : handle_irq <-do_IRQ + ls-2269 3d.H5 32us : irq_to_desc <-handle_irq + ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq [...] - sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action) - sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq) - sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq) - sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable) - sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) - - -This is a very interesting trace. It started with the preemption -of the ls task. We see that the task had the "need_resched" bit -set via the 'N' in the trace. Interrupts were disabled before -the spin_lock at the beginning of the trace. We see that a -schedule took place to run sshd. When the interrupts were -enabled, we took an interrupt. On return from the interrupt -handler, the softirq ran. We took another interrupt while -running the softirq as we see from the capital 'H'. + ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll + ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action + ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq + ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable + ls-2269 3d..3 159us : idle_cpu <-irq_exit + ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit + ls-2269 3d..3 160us : sub_preempt_count <-irq_exit + ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock + ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock + ls-2269 3d... 186us : <stack trace> + => __mutex_unlock_slowpath + => mutex_unlock + => process_output + => n_tty_write + => tty_write + => vfs_write + => sys_write + => system_call_fastpath + +This is an interesting trace. It started with kworker running and +scheduling out and ls taking over. But as soon as ls released the +rq lock and enabled interrupts (but not preemption) an interrupt +triggered. When the interrupt finished, it started running softirqs. +But while the softirq was running, another interrupt triggered. +When an interrupt is running inside a softirq, the annotation is 'H'. wakeup ------ +One common case that people are interested in tracing is the +time it takes for a task that is woken to actually wake up. +Now for non Real-Time tasks, this can be arbitrary. But tracing +it none the less can be interesting. + +Without function tracing: + + # echo 0 > options/function-trace + # echo wakeup > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # chrt -f 5 sleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: wakeup +# +# wakeup latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H + <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 3d..3 15us : __schedule <-schedule + <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H + +The tracer only traces the highest priority task in the system +to avoid tracing the normal circumstances. Here we see that +the kworker with a nice priority of -20 (not very nice), took +just 15 microseconds from the time it woke up, to the time it +ran. + +Non Real-Time tasks are not that interesting. A more interesting +trace is to concentrate only on Real-Time tasks. + +wakeup_rt +--------- + In a Real-Time environment it is very important to know the wakeup time it takes for the highest priority task that is woken up to the time that it executes. This is also known as "schedule @@ -914,124 +1423,229 @@ Real-Time environments are interested in the worst case latency. That is the longest latency it takes for something to happen, and not the average. We can have a very fast scheduler that may only have a large latency once in a while, but that would not -work well with Real-Time tasks. The wakeup tracer was designed +work well with Real-Time tasks. The wakeup_rt tracer was designed to record the worst case wakeups of RT tasks. Non-RT tasks are not recorded because the tracer only records one worst case and tracing non-RT tasks that are unpredictable will overwrite the -worst case latency of RT tasks. +worst case latency of RT tasks (just run the normal wakeup +tracer for a while to see that effect). Since this tracer only deals with RT tasks, we will run this slightly differently than we did with the previous tracers. Instead of performing an 'ls', we will run 'sleep 1' under 'chrt' which changes the priority of the task. - # echo wakeup > current_tracer - # echo latency-format > trace_options - # echo 0 > tracing_max_latency + # echo 0 > options/function-trace + # echo wakeup_rt > current_tracer # echo 1 > tracing_on + # echo 0 > tracing_max_latency # chrt -f 5 sleep 1 # echo 0 > tracing_on # cat trace # tracer: wakeup # -wakeup latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5) - ----------------- - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process) - <idle>-0 1d..4 4us : schedule (cpu_idle) - - -Running this on an idle system, we see that it only took 4 -microseconds to perform the task switch. Note, since the trace -marker in the schedule is before the actual "switch", we stop -the tracing when the recorded task is about to schedule in. This -may change if we add a new marker at the end of the scheduler. - -Notice that the recorded task is 'sleep' with the PID of 4901 +# tracer: wakeup_rt +# +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep + <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 3d..3 5us : __schedule <-schedule + <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep + + +Running this on an idle system, we see that it only took 5 microseconds +to perform the task switch. Note, since the trace point in the schedule +is before the actual "switch", we stop the tracing when the recorded task +is about to schedule in. This may change if we add a new marker at the +end of the scheduler. + +Notice that the recorded task is 'sleep' with the PID of 2389 and it has an rt_prio of 5. This priority is user-space priority and not the internal kernel priority. The policy is 1 for SCHED_FIFO and 2 for SCHED_RR. -Doing the same with chrt -r 5 and ftrace_enabled set. +Note, that the trace data shows the internal priority (99 - rtprio). -# tracer: wakeup + <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep + +The 0:120:R means idle was running with a nice priority of 0 (120 - 20) +and in the running state 'R'. The sleep task was scheduled in with +2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94) +and it too is in the running state. + +Doing the same with chrt -r 5 and function-trace set. + + echo 1 > options/function-trace + +# tracer: wakeup_rt # -wakeup latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5) - ----------------- - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / -ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process) -ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb) -ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up) -ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup) -ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr) -ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup) -ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up) -ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up) -[...] -ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt) -ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit) -ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit) -ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq) -[...] -ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks) -ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq) -ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable) -ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd) -ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd) -ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched) -ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched) -ksoftirq-7 1.N.2 33us : schedule (__cond_resched) -ksoftirq-7 1.N.2 33us : add_preempt_count (schedule) -ksoftirq-7 1.N.3 34us : hrtick_clear (schedule) -ksoftirq-7 1dN.3 35us : _spin_lock (schedule) -ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock) -ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule) -ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair) -[...] -ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline) -ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock) -ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline) -ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) -ksoftirq-7 1d..4 50us : schedule (__cond_resched) - -The interrupt went off while running ksoftirqd. This task runs -at SCHED_OTHER. Why did not we see the 'N' set early? This may -be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K -stacks configured, the interrupt and softirq run with their own -stack. Some information is held on the top of the task's stack -(need_resched and preempt_count are both stored there). The -setting of the NEED_RESCHED bit is done directly to the task's -stack, but the reading of the NEED_RESCHED is done by looking at -the current stack, which in this case is the stack for the hard -interrupt. This hides the fact that NEED_RESCHED has been set. -We do not see the 'N' until we switch back to the task's -assigned stack. +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep + <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup + <idle>-0 3d.h3 3us : resched_task <-check_preempt_curr + <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup + <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up + <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock + <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up + <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up + <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore + <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer + <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock + <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt + <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock + <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt + <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event + <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event + <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event + <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt + <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit + <idle>-0 3dN.2 9us : idle_cpu <-irq_exit + <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit + <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit + <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit + <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle + <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit + <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle + <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit + <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit + <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit + <idle>-0 3dN.1 13us : update_cpu_load_nohz <-tick_nohz_idle_exit + <idle>-0 3dN.1 13us : _raw_spin_lock <-update_cpu_load_nohz + <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock + <idle>-0 3dN.2 13us : __update_cpu_load <-update_cpu_load_nohz + <idle>-0 3dN.2 14us : sched_avg_update <-__update_cpu_load + <idle>-0 3dN.2 14us : _raw_spin_unlock <-update_cpu_load_nohz + <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock + <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel + <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel + <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 + <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave + <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16 + <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer + <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram + <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event + <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event + <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event + <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel + <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore + <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit + <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward + <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward + <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 + <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns + <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns + <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 + <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave + <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns + <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns + <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns + <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event + <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event + <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event + <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns + <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore + <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit + <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks + <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle + <idle>-0 3.N.. 25us : schedule <-cpu_idle + <idle>-0 3.N.. 25us : __schedule <-preempt_schedule + <idle>-0 3.N.. 26us : add_preempt_count <-__schedule + <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule + <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch + <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch + <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule + <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq + <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule + <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task + <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task + <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt + <idle>-0 3d..3 29us : __schedule <-preempt_schedule + <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep + +This isn't that big of a trace, even with function tracing enabled, +so I included the entire trace. + +The interrupt went off while when the system was idle. Somewhere +before task_woken_rt() was called, the NEED_RESCHED flag was set, +this is indicated by the first occurrence of the 'N' flag. + +Latency tracing and events +-------------------------- +As function tracing can induce a much larger latency, but without +seeing what happens within the latency it is hard to know what +caused it. There is a middle ground, and that is with enabling +events. + + # echo 0 > options/function-trace + # echo wakeup_rt > current_tracer + # echo 1 > events/enable + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # chrt -f 5 sleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: wakeup_rt +# +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep + <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002 + <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8 + <idle>-0 2.N.2 2us : power_end: cpu_id=2 + <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2 + <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0 + <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000 + <idle>-0 2.N.2 5us : rcu_utilization: Start context switch + <idle>-0 2.N.2 5us : rcu_utilization: End context switch + <idle>-0 2d..3 6us : __schedule <-schedule + <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep + function -------- @@ -1039,6 +1653,7 @@ function This tracer is the function tracer. Enabling the function tracer can be done from the debug file system. Make sure the ftrace_enabled is set; otherwise this tracer is a nop. +See the "ftrace_enabled" section below. # sysctl kernel.ftrace_enabled=1 # echo function > current_tracer @@ -1048,23 +1663,23 @@ ftrace_enabled is set; otherwise this tracer is a nop. # cat trace # tracer: function # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4003 [00] 123.638713: finish_task_switch <-schedule - bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch - bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq - bash-4003 [00] 123.638715: hrtick_set <-schedule - bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set - bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave - bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set - bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore - bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set - bash-4003 [00] 123.638718: sub_preempt_count <-schedule - bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule - bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run - bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion - bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common - bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq +# entries-in-buffer/entries-written: 24799/24799 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write + bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock + bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify + bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify + bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify + bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock + bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock + bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify [...] @@ -1214,79 +1829,19 @@ int main (int argc, char **argv) return 0; } +Or this simple script! -hw-branch-tracer (x86 only) ---------------------------- - -This tracer uses the x86 last branch tracing hardware feature to -collect a branch trace on all cpus with relatively low overhead. - -The tracer uses a fixed-size circular buffer per cpu and only -traces ring 0 branches. The trace file dumps that buffer in the -following format: - -# tracer: hw-branch-tracer -# -# CPU# TO <- FROM - 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6 - 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a - 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf - 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf - 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a - 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf - - -The tracer may be used to dump the trace for the oops'ing cpu on -a kernel oops into the system log. To enable this, -ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one -can either use the sysctl function or set it via the proc system -interface. - - sysctl kernel.ftrace_dump_on_oops=n - -or - - echo n > /proc/sys/kernel/ftrace_dump_on_oops - -If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will -only dump the buffer of the CPU that triggered the oops. - -Here's an example of such a dump after a null pointer -dereference in a kernel module: - -[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 -[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops] -[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0 -[57848.106019] Oops: 0002 [#1] SMP -[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus -[57848.106019] Dumping ftrace buffer: -[57848.106019] --------------------------------- -[...] -[57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24 -[57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165 -[57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165 -[57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165 -[57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165 -[57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops] -[57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30 -[57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b -[57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31 -[57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1 -[57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30 -[...] -[57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2 -[57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881 -[57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881 -[57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96 -[...] -[57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3 -[57848.106019] --------------------------------- -[57848.106019] CPU 0 -[57848.106019] Modules linked in: oops -[57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23 -[57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops] -[57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246 -[...] +------ +#!/bin/bash + +debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts` +echo nop > $debugfs/tracing/current_tracer +echo 0 > $debugfs/tracing/tracing_on +echo $$ > $debugfs/tracing/set_ftrace_pid +echo function > $debugfs/tracing/current_tracer +echo 1 > $debugfs/tracing/tracing_on +exec "$@" +------ function graph tracer @@ -1473,16 +2028,18 @@ starts of pointing to a simple return. (Enabling FTRACE will include the -pg switch in the compiling of the kernel.) At compile time every C file object is run through the -recordmcount.pl script (located in the scripts directory). This -script will process the C object using objdump to find all the -locations in the .text section that call mcount. (Note, only the -.text section is processed, since processing other sections like -.init.text may cause races due to those sections being freed). +recordmcount program (located in the scripts directory). This +program will parse the ELF headers in the C object to find all +the locations in the .text section that call mcount. (Note, only +white listed .text sections are processed, since processing other +sections like .init.text may cause races due to those sections +being freed unexpectedly). A new section called "__mcount_loc" is created that holds references to all the mcount call sites in the .text section. -This section is compiled back into the original object. The -final linker will add all these references into a single table. +The recordmcount program re-links this section back into the +original object. The final linking stage of the kernel will add all these +references into a single table. On boot up, before SMP is initialized, the dynamic ftrace code scans this table and updates all the locations into nops. It @@ -1493,13 +2050,25 @@ unloaded, it also removes its functions from the ftrace function list. This is automatic in the module unload code, and the module author does not need to worry about it. -When tracing is enabled, kstop_machine is called to prevent -races with the CPUS executing code being modified (which can -cause the CPU to do undesirable things), and the nops are +When tracing is enabled, the process of modifying the function +tracepoints is dependent on architecture. The old method is to use +kstop_machine to prevent races with the CPUs executing code being +modified (which can cause the CPU to do undesirable things, especially +if the modified code crosses cache (or page) boundaries), and the nops are patched back to calls. But this time, they do not call mcount (which is just a function stub). They now call into the ftrace infrastructure. +The new method of modifying the function tracepoints is to place +a breakpoint at the location to be modified, sync all CPUs, modify +the rest of the instruction not covered by the breakpoint. Sync +all CPUs again, and then remove the breakpoint with the finished +version to the ftrace call site. + +Some archs do not even need to monkey around with the synchronization, +and can just slap the new code on top of the old without any +problems with other CPUs executing it at the same time. + One special side-effect to the recording of the functions being traced is that we can now selectively choose which functions we wish to trace and which ones we want the mcount calls to remain @@ -1530,20 +2099,28 @@ mutex_lock If I am only interested in sys_nanosleep and hrtimer_interrupt: - # echo sys_nanosleep hrtimer_interrupt \ - > set_ftrace_filter + # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter # echo function > current_tracer # echo 1 > tracing_on # usleep 1 # echo 0 > tracing_on # cat trace -# tracer: ftrace +# tracer: function +# +# entries-in-buffer/entries-written: 5/5 #P:4 # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt - usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call - <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath + <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt + usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt + <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt + <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt To see which functions are being traced, you can cat the file: @@ -1571,20 +2148,25 @@ Note: It is better to use quotes to enclose the wild cards, Produces: -# tracer: ftrace +# tracer: function # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4003 [00] 1480.611794: hrtimer_init <-copy_process - bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set - bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear - bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel - <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt - +# entries-in-buffer/entries-written: 897/897 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit + <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel + <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer + <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit + <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 + <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt + <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter + <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem Notice that we lost the sys_nanosleep. @@ -1651,19 +2233,29 @@ traced. Produces: -# tracer: ftrace +# tracer: function +# +# entries-in-buffer/entries-written: 39608/39608 #P:4 # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4043 [01] 115.281644: finish_task_switch <-schedule - bash-4043 [01] 115.281645: hrtick_set <-schedule - bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set - bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run - bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion - bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run - bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop - bash-4043 [01] 115.281648: wake_up_process <-kthread_stop - bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open + bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last + bash-1994 [000] .... 4342.324897: ima_file_check <-do_last + bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check + bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement + bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action + bash-1994 [000] .... 4342.324899: do_truncate <-do_last + bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate + bash-1994 [000] .... 4342.324899: notify_change <-do_truncate + bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change + bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time + bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time We can see that there's no more lock or preempt tracing. @@ -1729,6 +2321,28 @@ this special filter via: echo > set_graph_function +ftrace_enabled +-------------- + +Note, the proc sysctl ftrace_enable is a big on/off switch for the +function tracer. By default it is enabled (when function tracing is +enabled in the kernel). If it is disabled, all function tracing is +disabled. This includes not only the function tracers for ftrace, but +also for any other uses (perf, kprobes, stack tracing, profiling, etc). + +Please disable this with care. + +This can be disable (and enabled) with: + + sysctl kernel.ftrace_enabled=0 + sysctl kernel.ftrace_enabled=1 + + or + + echo 0 > /proc/sys/kernel/ftrace_enabled + echo 1 > /proc/sys/kernel/ftrace_enabled + + Filter commands --------------- @@ -1763,12 +2377,58 @@ The following commands are supported: echo '__schedule_bug:traceoff:5' > set_ftrace_filter + To always disable tracing when __schedule_bug is hit: + + echo '__schedule_bug:traceoff' > set_ftrace_filter + These commands are cumulative whether or not they are appended to set_ftrace_filter. To remove a command, prepend it by '!' and drop the parameter: + echo '!__schedule_bug:traceoff:0' > set_ftrace_filter + + The above removes the traceoff command for __schedule_bug + that have a counter. To remove commands without counters: + echo '!__schedule_bug:traceoff' > set_ftrace_filter +- snapshot + Will cause a snapshot to be triggered when the function is hit. + + echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter + + To only snapshot once: + + echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter + + To remove the above commands: + + echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter + echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter + +- enable_event/disable_event + These commands can enable or disable a trace event. Note, because + function tracing callbacks are very sensitive, when these commands + are registered, the trace point is activated, but disabled in + a "soft" mode. That is, the tracepoint will be called, but + just will not be traced. The event tracepoint stays in this mode + as long as there's a command that triggers it. + + echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \ + set_ftrace_filter + + The format is: + + <function>:enable_event:<system>:<event>[:count] + <function>:disable_event:<system>:<event>[:count] + + To remove the events commands: + + + echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \ + set_ftrace_filter + echo '!schedule:disable_event:sched:sched_switch' > \ + set_ftrace_filter trace_pipe ---------- @@ -1787,28 +2447,31 @@ different. The trace is live. # cat trace # tracer: function # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | +# entries-in-buffer/entries-written: 0/0 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | # # cat /tmp/trace.out - bash-4043 [00] 41.267106: finish_task_switch <-schedule - bash-4043 [00] 41.267106: hrtick_set <-schedule - bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set - bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run - bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion - bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run - bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop - bash-4043 [00] 41.267110: wake_up_process <-kthread_stop - bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process - bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up + bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write + bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock + bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify + bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify + bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify + bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock + bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock + bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify + bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath Note, reading the trace_pipe file will block until more input is -added. By changing the tracer, trace_pipe will issue an EOF. We -needed to set the function tracer _before_ we "cat" the -trace_pipe file. - +added. trace entries ------------- @@ -1817,31 +2480,50 @@ Having too much or not enough data can be troublesome in diagnosing an issue in the kernel. The file buffer_size_kb is used to modify the size of the internal trace buffers. The number listed is the number of entries that can be recorded per -CPU. To know the full size, multiply the number of possible CPUS +CPU. To know the full size, multiply the number of possible CPUs with the number of entries. # cat buffer_size_kb 1408 (units kilobytes) -Note, to modify this, you must have tracing completely disabled. -To do that, echo "nop" into the current_tracer. If the -current_tracer is not set to "nop", an EINVAL error will be -returned. +Or simply read buffer_total_size_kb + + # cat buffer_total_size_kb +5632 + +To modify the buffer, simple echo in a number (in 1024 byte segments). - # echo nop > current_tracer # echo 10000 > buffer_size_kb # cat buffer_size_kb 10000 (units kilobytes) -The number of pages which will be allocated is limited to a -percentage of available memory. Allocating too much will produce -an error. +It will try to allocate as much as possible. If you allocate too +much, it can cause Out-Of-Memory to trigger. # echo 1000000000000 > buffer_size_kb -bash: echo: write error: Cannot allocate memory # cat buffer_size_kb 85 +The per_cpu buffers can be changed individually as well: + + # echo 10000 > per_cpu/cpu0/buffer_size_kb + # echo 100 > per_cpu/cpu1/buffer_size_kb + +When the per_cpu buffers are not the same, the buffer_size_kb +at the top level will just show an X + + # cat buffer_size_kb +X + +This is where the buffer_total_size_kb is useful: + + # cat buffer_total_size_kb +12916 + +Writing to the top level buffer_size_kb will reset all the buffers +to be the same again. + Snapshot -------- CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature @@ -1873,7 +2555,7 @@ feature: status\input | 0 | 1 | else | --------------+------------+------------+------------+ - not allocated |(do nothing)| alloc+swap | EINVAL | + not allocated |(do nothing)| alloc+swap |(do nothing)| --------------+------------+------------+------------+ allocated | free | swap | clear | --------------+------------+------------+------------+ @@ -1925,7 +2607,188 @@ bash: echo: write error: Device or resource busy # cat snapshot cat: snapshot: Device or resource busy + +Instances +--------- +In the debugfs tracing directory is a directory called "instances". +This directory can have new directories created inside of it using +mkdir, and removing directories with rmdir. The directory created +with mkdir in this directory will already contain files and other +directories after it is created. + + # mkdir instances/foo + # ls instances/foo +buffer_size_kb buffer_total_size_kb events free_buffer per_cpu +set_event snapshot trace trace_clock trace_marker trace_options +trace_pipe tracing_on + +As you can see, the new directory looks similar to the tracing directory +itself. In fact, it is very similar, except that the buffer and +events are agnostic from the main director, or from any other +instances that are created. + +The files in the new directory work just like the files with the +same name in the tracing directory except the buffer that is used +is a separate and new buffer. The files affect that buffer but do not +affect the main buffer with the exception of trace_options. Currently, +the trace_options affect all instances and the top level buffer +the same, but this may change in future releases. That is, options +may become specific to the instance they reside in. + +Notice that none of the function tracer files are there, nor is +current_tracer and available_tracers. This is because the buffers +can currently only have events enabled for them. + + # mkdir instances/foo + # mkdir instances/bar + # mkdir instances/zoot + # echo 100000 > buffer_size_kb + # echo 1000 > instances/foo/buffer_size_kb + # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb + # echo function > current_trace + # echo 1 > instances/foo/events/sched/sched_wakeup/enable + # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable + # echo 1 > instances/foo/events/sched/sched_switch/enable + # echo 1 > instances/bar/events/irq/enable + # echo 1 > instances/zoot/events/syscalls/enable + # cat trace_pipe +CPU:2 [LOST 11745 EVENTS] + bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist + bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave + bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist + bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist + bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock + bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype + bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist + bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist + bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics + bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics + bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process +[...] + + # cat instances/foo/trace_pipe + bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 + bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 + <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003 + <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120 + rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 + bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 + bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 + bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120 + kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001 + kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120 +[...] + + # cat instances/bar/trace_pipe + migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX] + <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX] + bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER] + bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU] + bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER] + bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER] + bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU] + bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU] + sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4 + sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled + sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0 + sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled +[...] + + # cat instances/zoot/trace +# tracer: nop +# +# entries-in-buffer/entries-written: 18996/18996 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1998 [000] d... 140.733501: sys_write -> 0x2 + bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1) + bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1 + bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0) + bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1 + bash-1998 [000] d... 140.733510: sys_close(fd: a) + bash-1998 [000] d... 140.733510: sys_close -> 0x0 + bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8) + bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0 + bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8) + bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0 + +You can see that the trace of the top most trace buffer shows only +the function tracing. The foo instance displays wakeups and task +switches. + +To remove the instances, simply delete their directories: + + # rmdir instances/foo + # rmdir instances/bar + # rmdir instances/zoot + +Note, if a process has a trace file open in one of the instance +directories, the rmdir will fail with EBUSY. + + +Stack trace ----------- +Since the kernel has a fixed sized stack, it is important not to +waste it in functions. A kernel developer must be conscience of +what they allocate on the stack. If they add too much, the system +can be in danger of a stack overflow, and corruption will occur, +usually leading to a system panic. + +There are some tools that check this, usually with interrupts +periodically checking usage. But if you can perform a check +at every function call that will become very useful. As ftrace provides +a function tracer, it makes it convenient to check the stack size +at every function call. This is enabled via the stack tracer. + +CONFIG_STACK_TRACER enables the ftrace stack tracing functionality. +To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled. + + # echo 1 > /proc/sys/kernel/stack_tracer_enabled + +You can also enable it from the kernel command line to trace +the stack size of the kernel during boot up, by adding "stacktrace" +to the kernel command line parameter. + +After running it for a few minutes, the output looks like: + + # cat stack_max_size +2928 + + # cat stack_trace + Depth Size Location (18 entries) + ----- ---- -------- + 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac + 1) 2704 160 find_busiest_group+0x31/0x1f1 + 2) 2544 256 load_balance+0xd9/0x662 + 3) 2288 80 idle_balance+0xbb/0x130 + 4) 2208 128 __schedule+0x26e/0x5b9 + 5) 2080 16 schedule+0x64/0x66 + 6) 2064 128 schedule_timeout+0x34/0xe0 + 7) 1936 112 wait_for_common+0x97/0xf1 + 8) 1824 16 wait_for_completion+0x1d/0x1f + 9) 1808 128 flush_work+0xfe/0x119 + 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20 + 11) 1664 48 input_available_p+0x1d/0x5c + 12) 1616 48 n_tty_poll+0x6d/0x134 + 13) 1568 64 tty_poll+0x64/0x7f + 14) 1504 880 do_select+0x31e/0x511 + 15) 624 400 core_sys_select+0x177/0x216 + 16) 224 96 sys_select+0x91/0xb9 + 17) 128 128 system_call_fastpath+0x16/0x1b + +Note, if -mfentry is being used by gcc, functions get traced before +they set up the stack frame. This means that leaf level functions +are not tested by the stack tracer when -mfentry is used. + +Currently, -mfentry is used by gcc 4.6.0 and above on x86 only. + +--------- More details can be found in the source code, in the kernel/trace/*.c files. diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt index 24ce6823a09e..d9c3e682312c 100644 --- a/Documentation/trace/uprobetracer.txt +++ b/Documentation/trace/uprobetracer.txt @@ -1,6 +1,8 @@ - Uprobe-tracer: Uprobe-based Event Tracing - ========================================= - Documentation written by Srikar Dronamraju + Uprobe-tracer: Uprobe-based Event Tracing + ========================================= + + Documentation written by Srikar Dronamraju + Overview -------- @@ -13,78 +15,94 @@ current_tracer. Instead of that, add probe points via /sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled. However unlike kprobe-event tracer, the uprobe event interface expects the -user to calculate the offset of the probepoint in the object +user to calculate the offset of the probepoint in the object. Synopsis of uprobe_tracer ------------------------- - p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS] : Set a probe + p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS] : Set a uprobe + r[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS] : Set a return uprobe (uretprobe) + -:[GRP/]EVENT : Clear uprobe or uretprobe event - GRP : Group name. If omitted, use "uprobes" for it. - EVENT : Event name. If omitted, the event name is generated - based on SYMBOL+offs. - PATH : path to an executable or a library. - SYMBOL[+offs] : Symbol+offset where the probe is inserted. + GRP : Group name. If omitted, "uprobes" is the default value. + EVENT : Event name. If omitted, the event name is generated based + on SYMBOL+offs. + PATH : Path to an executable or a library. + SYMBOL[+offs] : Symbol+offset where the probe is inserted. - FETCHARGS : Arguments. Each probe can have up to 128 args. - %REG : Fetch register REG + FETCHARGS : Arguments. Each probe can have up to 128 args. + %REG : Fetch register REG Event Profiling --------------- - You can check the total number of probe hits and probe miss-hits via +You can check the total number of probe hits and probe miss-hits via /sys/kernel/debug/tracing/uprobe_profile. - The first column is event name, the second is the number of probe hits, +The first column is event name, the second is the number of probe hits, the third is the number of probe miss-hits. Usage examples -------------- -To add a probe as a new event, write a new definition to uprobe_events -as below. + * Add a probe as a new uprobe event, write a new definition to uprobe_events +as below: (sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash) + + echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events + + * Add a probe as a new uretprobe event: + + echo 'r: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events + + * Unset registered event: - echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events + echo '-:bash_0x4245c0' >> /sys/kernel/debug/tracing/uprobe_events - This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash + * Print out the events that are registered: - echo > /sys/kernel/debug/tracing/uprobe_events + cat /sys/kernel/debug/tracing/uprobe_events - This clears all probe points. + * Clear all events: -The following example shows how to dump the instruction pointer and %ax -a register at the probed text address. Here we are trying to probe -function zfree in /bin/zsh + echo > /sys/kernel/debug/tracing/uprobe_events + +Following example shows how to dump the instruction pointer and %ax register +at the probed text address. Probe zfree function in /bin/zsh: # cd /sys/kernel/debug/tracing/ - # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp + # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh # objdump -T /bin/zsh | grep -w zfree 0000000000446420 g DF .text 0000000000000012 Base zfree -0x46420 is the offset of zfree in object /bin/zsh that is loaded at -0x00400000. Hence the command to probe would be : + 0x46420 is the offset of zfree in object /bin/zsh that is loaded at + 0x00400000. Hence the command to uprobe would be: + + # echo 'p:zfree_entry /bin/zsh:0x46420 %ip %ax' > uprobe_events + + And the same for the uretprobe would be: - # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events + # echo 'r:zfree_exit /bin/zsh:0x46420 %ip %ax' >> uprobe_events -Please note: User has to explicitly calculate the offset of the probepoint +Please note: User has to explicitly calculate the offset of the probe-point in the object. We can see the events that are registered by looking at the uprobe_events file. # cat uprobe_events - p:uprobes/p_zsh_0x46420 /bin/zsh:0x00046420 arg1=%ip arg2=%ax + p:uprobes/zfree_entry /bin/zsh:0x00046420 arg1=%ip arg2=%ax + r:uprobes/zfree_exit /bin/zsh:0x00046420 arg1=%ip arg2=%ax -The format of events can be seen by viewing the file events/uprobes/p_zsh_0x46420/format +Format of events can be seen by viewing the file events/uprobes/zfree_entry/format - # cat events/uprobes/p_zsh_0x46420/format - name: p_zsh_0x46420 + # cat events/uprobes/zfree_entry/format + name: zfree_entry ID: 922 format: - field:unsigned short common_type; offset:0; size:2; signed:0; - field:unsigned char common_flags; offset:2; size:1; signed:0; - field:unsigned char common_preempt_count; offset:3; size:1; signed:0; - field:int common_pid; offset:4; size:4; signed:1; - field:int common_padding; offset:8; size:4; signed:1; + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; + field:int common_pid; offset:4; size:4; signed:1; + field:int common_padding; offset:8; size:4; signed:1; - field:unsigned long __probe_ip; offset:12; size:4; signed:0; - field:u32 arg1; offset:16; size:4; signed:0; - field:u32 arg2; offset:20; size:4; signed:0; + field:unsigned long __probe_ip; offset:12; size:4; signed:0; + field:u32 arg1; offset:16; size:4; signed:0; + field:u32 arg2; offset:20; size:4; signed:0; print fmt: "(%lx) arg1=%lx arg2=%lx", REC->__probe_ip, REC->arg1, REC->arg2 @@ -94,6 +112,7 @@ events, you need to enable it by: # echo 1 > events/uprobes/enable Lets disable the event after sleeping for some time. + # sleep 20 # echo 0 > events/uprobes/enable @@ -104,10 +123,11 @@ And you can see the traced information via /sys/kernel/debug/tracing/trace. # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | - zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 - zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 - zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 - zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 - -Each line shows us probes were triggered for a pid 24842 with ip being -0x446421 and contents of ax register being 79. + zsh-24842 [006] 258544.995456: zfree_entry: (0x446420) arg1=446420 arg2=79 + zsh-24842 [007] 258545.000270: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0 + zsh-24842 [002] 258545.043929: zfree_entry: (0x446420) arg1=446420 arg2=79 + zsh-24842 [004] 258547.046129: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0 + +Output shows us uprobe was triggered for a pid 24842 with ip being 0x446420 +and contents of ax register being 79. And uretprobe was triggered with ip at +0x446540 with counterpart function entry at 0x446420. diff --git a/Documentation/usb/power-management.txt b/Documentation/usb/power-management.txt index 4204eb01fd38..1392b61d6ebe 100644 --- a/Documentation/usb/power-management.txt +++ b/Documentation/usb/power-management.txt @@ -33,6 +33,10 @@ built with CONFIG_USB_SUSPEND enabled (which depends on CONFIG_PM_RUNTIME). System PM support is present only if the kernel was built with CONFIG_SUSPEND or CONFIG_HIBERNATION enabled. +(Starting with the 3.10 kernel release, dynamic PM support for USB is +present whenever the kernel was built with CONFIG_PM_RUNTIME enabled. +The CONFIG_USB_SUSPEND option has been eliminated.) + What is Remote Wakeup? ---------------------- @@ -206,10 +210,8 @@ initialized to 5. (The idle-delay values for already existing devices will not be affected.) Setting the initial default idle-delay to -1 will prevent any -autosuspend of any USB device. This is a simple alternative to -disabling CONFIG_USB_SUSPEND and rebuilding the kernel, and it has the -added benefit of allowing you to enable autosuspend for selected -devices. +autosuspend of any USB device. This has the benefit of allowing you +then to enable autosuspend for selected devices. Warnings diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting index 706d7ed9d8d2..8eaa2fc4b8fa 100644 --- a/Documentation/vm/overcommit-accounting +++ b/Documentation/vm/overcommit-accounting @@ -8,7 +8,9 @@ The Linux kernel supports the following overcommit handling modes default. 1 - Always overcommit. Appropriate for some scientific - applications. + applications. Classic example is code using sparse arrays + and just relying on the virtual memory consisting almost + entirely of zero pages. 2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a @@ -18,6 +20,10 @@ The Linux kernel supports the following overcommit handling modes pages but will receive errors on memory allocation as appropriate. + Useful for applications that want to guarantee their + memory allocations will be available in the future + without having to initialize every page. + The overcommit policy is set via the sysctl `vm.overcommit_memory'. The overcommit percentage is set via `vm.overcommit_ratio'. |