summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* tracing: Consolidate max_tr into main trace_array structureSteven Rostedt (Red Hat)2013-03-1514-286/+365
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the way the latency tracers and snapshot feature works is to have a separate trace_array called "max_tr" that holds the snapshot buffer. For latency tracers, this snapshot buffer is used to swap the running buffer with this buffer to save the current max latency. The only items needed for the max_tr is really just a copy of the buffer itself, the per_cpu data pointers, the time_start timestamp that states when the max latency was triggered, and the cpu that the max latency was triggered on. All other fields in trace_array are unused by the max_tr, making the max_tr mostly bloat. This change removes the max_tr completely, and adds a new structure called trace_buffer, that holds the buffer pointer, the per_cpu data pointers, the time_start timestamp, and the cpu where the latency occurred. The trace_array, now has two trace_buffers, one for the normal trace and one for the max trace or snapshot. By doing this, not only do we remove the bloat from the max_trace but the instances of traces can now use their own snapshot feature and not have just the top level global_trace have the snapshot feature and latency tracers for itself. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Enable snapshot when any latency tracer is enabledSteven Rostedt (Red Hat)2013-03-151-0/+3
| | | | | | | | | | | | | | The snapshot utility is extremely useful, and does not add any more overhead in memory when another latency tracer is enabled. They use the snapshot underneath. There's no reason to hide the snapshot file when a latency tracer has been enabled in the kernel. If any of the latency tracers (irq, preempt or wakeup) is enabled then also select the snapshot facility. Note, snapshot can be enabled without the latency tracers enabled. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Clear all trace buffers when unloaded module event was usedSteven Rostedt (Red Hat)2013-03-153-6/+16
| | | | | | | | | | | | Currently we do not know what buffer a module event was enabled in. On unload, it is safest to clear all buffer instances, not just the top level buffer. Todo: Clear only the buffer that the event was used in. The infrastructure is there to do this, but it makes the code a bit more complex. Lets get the current code vetted before we add that. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Only clear trace buffer on module unload if event was tracedSteven Rostedt (Red Hat)2013-03-152-4/+13
| | | | | | | | | | | | | | Currently, when a module with events is unloaded, the trace buffer is cleared. This is just a safety net in case the module might have some strange callback when its event is outputted. But there's no reason to reset the buffer if the module didn't have any of its events traced. Add a flag to the event "call" structure called WAS_ENABLED and gets set when the event is ever enabled, and this flag never gets cleared. When a module gets unloaded, if any of its events have this flag set, then the trace buffer will get cleared. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Add comment for trace event flag IGNORE_ENABLESteven Rostedt (Red Hat)2013-03-151-0/+1
| | | | | | | | | | | | | | All the trace event flags have comments but the IGNORE_ENABLE flag which is set for ftrace internal events that should not be enabled via the debugfs "enable" file. That is, if the top level enable file is set, it will enable all events. It use to just check the ftrace event call descriptor "reg" field and skip those whithout it, but now some ftrace internal events have a reg field but still need to be skipped. The flag was created to ignore those events. Now document it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* ring-buffer: Init waitqueue for blocked readersSteven Rostedt (Red Hat)2013-03-151-0/+2
| | | | | | | | | | The move of blocked readers to the ring buffer left out the init of the wait queue that is used. Tests missed this due to running stress tests against the buffers, which didn't allow for any readers to end up waiting. Running a simple read and wait triggered a bug. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Fix some section mismatch warningsLi Zefan2013-03-153-4/+4
| | | | | | | | | | | | As we've added __init annotation to field-defining functions, we should add __refdata annotation to event_call variables, which reference those functions. Link: http://lkml.kernel.org/r/51343C1F.2050502@huawei.com Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Fix trace events build without modulesSteven Rostedt (Red Hat)2013-03-151-19/+36
| | | | | | | | | | | | | | | | | The new multi-buffers added a descriptor that kept track of module events, and the directories they use, with struct ftace_module_file_ops. This is used to add a ref count to keep modules from unloading while their files are being accessed. As the descriptor is only needed when CONFIG_MODULES is enabled, it is only declared when the config is enabled. But that struct is dereferenced in a few areas outside the #ifdef CONFIG_MODULES. By adding some helper routines and moving code around a little, events can be compiled again without modules. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Add __per_cpu annotation to trace array percpu data pointerSteven Rostedt (Red Hat)2013-03-151-1/+1
| | | | | | | | | | | With the conversion of the data array to per cpu, sparse now complains about the use of per_cpu_ptr() on the variable. But The variable is allocated with alloc_percpu() and is fine to use. But since the structure that contains the data variable does not annotate it as such, sparse gives out a lot of false warnings. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing/syscalls: Annotate field-defining functions with __initLi Zefan2013-03-151-2/+2
| | | | | | | | | These two functions are called during kernel boot only. Link: http://lkml.kernel.org/r/51258796.7020704@huawei.com Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Annotate event field-defining functions with __initLi Zefan2013-03-152-2/+2
| | | | | | | | | | | | | | | | | | | | | | | Those functions are called either during kernel boot or module init. Before: $ dmesg | grep 'Freeing unused kernel memory' Freeing unused kernel memory: 1208k freed Freeing unused kernel memory: 1360k freed Freeing unused kernel memory: 1960k freed After: $ dmesg | grep 'Freeing unused kernel memory' Freeing unused kernel memory: 1236k freed Freeing unused kernel memory: 1388k freed Freeing unused kernel memory: 1960k freed Link: http://lkml.kernel.org/r/5125877D.5000201@huawei.com Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Add a helper function for event print functionsLi Zefan2013-03-153-19/+38
| | | | | | | | | | | | | | | Move duplicate code in event print functions to a helper function. This shrinks the size of the kernel by ~13K. text data bss dec hex filename 6596137 1743966 10138672 18478775 119f6b7 vmlinux.o.old 6583002 1743849 10138672 18465523 119c2f3 vmlinux.o.new Link: http://lkml.kernel.org/r/51258746.2060304@huawei.com Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing/ring-buffer: Move poll wake ups into ring buffer codeSteven Rostedt (Red Hat)2013-03-153-71/+164
| | | | | | | | | | Move the logic to wake up on ring buffer data into the ring buffer code itself. This simplifies the tracing code a lot and also has the added benefit that waiters on one of the instance buffers can be woken only when data is added to that instance instead of data added to any instance. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Fix read blocking on trace_pipe_rawSteven Rostedt2013-03-151-3/+15
| | | | | | | | | | | If the ring buffer is empty, a read to trace_pipe_raw wont block. The tracing code has the infrastructure to wake up waiting readers, but the trace_pipe_raw doesn't take advantage of that. When a read is done to trace_pipe_raw without the O_NONBLOCK flag set, have the read block until there's data in the requested buffer. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Fix polling on trace_pipe_rawSteven Rostedt2013-03-151-27/+51
| | | | | | | | | | | The trace_pipe_raw never implemented polling and this was casing issues for several utilities. This is now implemented. Blocked reads still are on the TODO list. Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com> Tested-by: Mauro Carvalho Chehab <mchehab@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Do not block on splice if either file or splice NONBLOCK flag is setSteven Rostedt (Red Hat)2013-03-151-1/+1
| | | | | | | | Currently only the splice NONBLOCK flag is checked to determine if the splice read should block or not. But the file descriptor NONBLOCK flag also needs to be checked. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Use direct field, type and system namesSteven Rostedt2013-03-152-19/+5
| | | | | | | | | | | | | The names used to display the field and type in the event format files are copied, as well as the system name that is displayed. All these names are created by constant values passed in. If one of theses values were to be removed by a module, the module would also be required to remove any event it created. By using the strings directly, we can save over 100K of memory. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Use kmem_cache_alloc instead of kmalloc in trace_events.cSteven Rostedt2013-03-151-7/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The event structures used by the trace events are mostly persistent, but they are also allocated by kmalloc, which is not the best at allocating space for what is used. By converting these kmallocs into kmem_cache_allocs, we can save over 50K of space that is permanently allocated. After boot we have: slab name active allocated size --------- ------ --------- ---- ftrace_event_file 979 1005 56 67 1 ftrace_event_field 2301 2310 48 77 1 The ftrace_event_file has at boot up 979 active objects out of 1005 allocated in the slabs. Each object is 56 bytes. In a normal kmalloc, that would allocate 64 bytes for each object. 1005 - 979 = 26 objects not used 26 * 56 = 1456 bytes wasted But if we used kmalloc: 64 - 56 = 8 bytes unused per allocation 8 * 979 = 7832 bytes wasted 7832 - 1456 = 6376 bytes in savings Doing the same for ftrace_event_field where there's 2301 objects allocated in a slab that can hold 2310 with 48 bytes each we have: 2310 - 2301 = 9 objects not used 9 * 48 = 432 bytes wasted A kmalloc would also use 64 bytes per object: 64 - 48 = 16 bytes unused per allocation 16 * 2301 = 36816 bytes wasted! 36816 - 432 = 36384 bytes in savings This change gives us a total of 42760 bytes in savings. At least on my machine, but as there's a lot of these persistent objects for all configurations that use trace points, this is a net win. Thanks to Ezequiel Garcia for his trace_analyze presentation which pointed out the wasted space in my code. Cc: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Get trace_events kernel command line working againSteven Rostedt2013-03-151-7/+136
| | | | | | | | | | | | | | | | | | | | | | With the new descriptors used to allow multiple buffers in the tracing directory added, the kernel command line parameter trace_events=... no longer works. This is because the top level (global) trace array now has a list of descriptors associated with the events and the files in the debugfs directory. But in early bootup, when the command line is processed and the events enabled, the trace array list of events has not been set up yet. Without the list of events in the trace array, the setting of events to record will fail because it would not match any events. The solution is to set up the top level array in two stages. The first is to just add the ftrace file descriptors that just point to the events. This will allow events to be enabled and start tracing. The second stage is called after the filesystem is set up, and this stage will create the debugfs event files and directories associated with the trace array events. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Add rmdir to remove multibuffer instancesSteven Rostedt2013-03-153-0/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add a method to the hijacked dentry descriptor of the "instances" directory to allow for rmdir to remove an instance of a multibuffer. Example: cd /debug/tracing/instances mkdir hello ls hello/ rmdir hello ls Like the mkdir method, the i_mutex is dropped for the instances directory. The instances directory is created at boot up and can not be renamed or removed. The trace_types_lock mutex is used to synchronize adding and removing of instances. I've run several stress tests with different threads trying to create and delete directories of the same name, and it has stood up fine. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Add interface to allow multiple trace buffersSteven Rostedt2013-03-153-1/+142
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the interface ("instances" directory) to add multiple buffers to ftrace. To create a new instance, simply do a mkdir in the instances directory: This will create a directory with the following: # cd instances # mkdir foo # ls foo buffer_size_kb free_buffer trace_clock trace_pipe buffer_total_size_kb set_event trace_marker tracing_enabled events/ trace trace_options tracing_on Currently only events are able to be set, and there isn't a way to delete a buffer when one is created (yet). Note, the i_mutex lock is dropped from the parent "instances" directory during the mkdir operation. As the "instances" directory can not be renamed or deleted (created on boot), I do not see any harm in dropping the lock. The creation of the sub directories is protected by trace_types_lock mutex, which only lets one instance get into the code path at a time. If two tasks try to create or delete directories of the same name, only one will occur and the other will fail with -EEXIST. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Make syscall events suitable for multiple buffersSteven Rostedt2013-03-152-34/+57
| | | | | | | | | | | | | | | Currently the syscall events record into the global buffer. But if multiple buffers are in place, then we need to have syscall events record in the proper buffers. By adding descriptors to pass to the syscall event functions, the syscall events can now record into the buffers that have been assigned to them (one event may be applied to mulitple buffers). This will allow tracing high volume syscalls along with seldom occurring syscalls without losing the seldom syscall events. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Replace the static global per_cpu arrays with allocated per_cpuSteven Rostedt2013-03-1510-61/+79
| | | | | | | | | The global and max-tr currently use static per_cpu arrays for the CPU data descriptors. But in order to get new allocated trace_arrays, they need to be allocated per_cpu arrays. Instead of using the static arrays, switch the global and max-tr to use allocated data. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Pass the ftrace_file to the buffer lock reserve codeSteven Rostedt2013-03-153-4/+24
| | | | | | | | | | | | | | Pass the struct ftrace_event_file *ftrace_file to the trace_event_buffer_lock_reserve() (new function that replaces the trace_current_buffer_lock_reserver()). The ftrace_file holds a pointer to the trace_array that is in use. In the case of multiple buffers with different trace_arrays, this allows different events to be recorded into different buffers. Also fixed some of the stale comments in include/trace/ftrace.h Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Encapsulate global_trace and remove dependencies on global varsSteven Rostedt2013-03-154-240/+358
| | | | | | | | | | | | The global_trace variable in kernel/trace/trace.c has been kept 'static' and local to that file so that it would not be used too much outside of that file. This has paid off, even though there were lots of changes to make the trace_array structure more generic (not depending on global_trace). Removal of a lot of direct usages of global_trace is needed to be able to create more trace_arrays such that we can add multiple buffers. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Use RING_BUFFER_ALL_CPUS for TRACE_PIPE_ALL_CPUSteven Rostedt2013-03-153-18/+16
| | | | | | | | | | | | | | Both RING_BUFFER_ALL_CPUS and TRACE_PIPE_ALL_CPU are defined as -1 and used to say that all the ring buffers are to be modified or read (instead of just a single cpu, which would be >= 0). There's no reason to keep TRACE_PIPE_ALL_CPU as it is also started to be used for more than what it was created for, and now that the ring buffer code added a generic RING_BUFFER_ALL_CPUS define, we can clean up the trace code to use that instead and remove the TRACE_PIPE_ALL_CPU macro. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Separate out trace events from global variablesSteven Rostedt2013-03-156-260/+622
| | | | | | | | | | | | | | | | The trace events for ftrace are all defined via global variables. The arrays of events and event systems are linked to a global list. This prevents multiple users of the event system (what to enable and what not to). By adding descriptors to represent the event/file relation, as well as to which trace_array descriptor they are associated with, allows for more than one set of events to be defined. Once the trace events files have a link between the trace event and the trace_array they are associated with, we can create multiple trace_arrays that can record separate events in separate buffers. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Prevent buffer overwrite disabled for latency tracersSteven Rostedt (Red Hat)2013-03-154-16/+65
| | | | | | | | | | | | | The latency tracers require the buffers to be in overwrite mode, otherwise they get screwed up. Force the buffers to stay in overwrite mode when latency tracers are enabled. Added a flag_changed() method to the tracer structure to allow the tracers to see what flags are being changed, and also be able to prevent the change from happing. Cc: stable@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Keep overwrite in sync between regular and snapshot buffersSteven Rostedt (Red Hat)2013-03-151-1/+5
| | | | | | | | | | | | | Changing the overwrite mode for the ring buffer via the trace option only sets the normal buffer. But the snapshot buffer could swap with it, and then the snapshot would be in non overwrite mode and the normal buffer would be in overwrite mode, even though the option flag states otherwise. Keep the two buffers overwrite modes in sync. Cc: stable@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Protect tracer flags with trace_types_lockSteven Rostedt (Red Hat)2013-03-141-4/+8
| | | | | | | | | | | | | Seems that the tracer flags have never been protected from synchronous writes. Luckily, admins don't usually modify the tracing flags via two different tasks. But if scripts were to be used to modify them, then they could get corrupted. Move the trace_types_lock that protects against tracers changing to also protect the flags being set. Cc: stable@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Fix free of probe entry by calling call_rcu_sched()Steven Rostedt (Red Hat)2013-03-131-2/+2
| | | | | | | | | | | | | | | | | Because function tracing is very invasive, and can even trace calls to rcu_read_lock(), RCU access in function tracing is done with preempt_disable_notrace(). This requires a synchronize_sched() for updates and not a synchronize_rcu(). Function probes (traceon, traceoff, etc) must be freed after a synchronize_sched() after its entry has been removed from the hash. But call_rcu() is used. Fix this by using call_rcu_sched(). Also fix the usage to use hlist_del_rcu() instead of hlist_del(). Cc: stable@vger.kernel.org Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Fix race in snapshot swappingSteven Rostedt (Red Hat)2013-03-121-1/+2
| | | | | | | | | | | | | | | | | | | | | Although the swap is wrapped with a spin_lock, the assignment of the temp buffer used to swap is not within that lock. It needs to be moved into that lock, otherwise two swaps happening on two different CPUs, can end up using the wrong temp buffer to assign in the swap. Luckily, all current callers of the swap function appear to have their own locks. But in case something is added that allows two different callers to call the swap, then there's a chance that this race can trigger and corrupt the buffers. New code is coming soon that will allow for this race to trigger. I've Cc'd stable, so this bug will not show up if someone backports one of the changes that can trigger this bug. Cc: stable@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: update documentation of snapshot utilityHiraku Toyooka2013-03-081-1/+1
| | | | | | | | | | | Now, "snapshot" file returns success on a reset of snapshot buffer even if the buffer wasn't allocated, instead of returning EINVAL. This patch updates snapshot desctiption according to the change. Link: http://lkml.kernel.org/r/51399409.4090207@hitachi.com Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Do not return EINVAL in snapshot when not allocatedSteven Rostedt (Red Hat)2013-03-071-2/+0
| | | | | | | | | | | | | | | | | | To use the tracing snapshot feature, writing a '1' into the snapshot file causes the snapshot buffer to be allocated if it has not already been allocated and dose a 'swap' with the main buffer, so that the snapshot now contains what was in the main buffer, and the main buffer now writes to what was the snapshot buffer. To free the snapshot buffer, a '0' is written into the snapshot file. To clear the snapshot buffer, any number but a '0' or '1' is written into the snapshot file. But if the file is not allocated it returns -EINVAL error code. This is rather pointless. It is better just to do nothing and return success. Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Add help of snapshot feature when snapshot is emptySteven Rostedt (Red Hat)2013-03-071-1/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When cat'ing the snapshot file, instead of showing an empty trace header like the trace file does, show how to use the snapshot feature. Also, this is a good place to show if the snapshot has been allocated or not. Users may want to "pre allocate" the snapshot to have a fast "swap" of the current buffer. Otherwise, a swap would be slow and might fail as it would need to allocate the snapshot buffer, and that might fail under tight memory constraints. Here's what it looked like before: # tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | Here's what it looks like now: # tracer: nop # # # * Snapshot is freed * # # Snapshot commands: # echo 0 > snapshot : Clears and frees snapshot buffer # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated. # Takes a snapshot of the main buffer. # echo 2 > snapshot : Clears snapshot buffer (but does not allocate) # (Doesn't have to be '2' works with any number that # is not a '0' or '1') Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* Merge branch 'tip/perf/urgent' of ↵Ingo Molnar2013-02-281-10/+14
|\ | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent Pull an ftrace Kconfig help text fix from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * ftrace: Update the kconfig for DYNAMIC_FTRACESteven Rostedt2013-02-281-10/+14
|/ | | | | | | | | | | | | | | | | | | | | | | The prompt to enable DYNAMIC_FTRACE (the ability to nop and enable function tracing at run time) had a confusing statement: "enable/disable ftrace tracepoints dynamically" This was written before tracepoints were added to the kernel, but now that tracepoints have been added, this is very confusing and has confused people enough to give wrong information during presentations. Not only that, I looked at the help text, and it still references that dreaded daemon that use to wake up once a second to update the nop locations and brick NICs, that hasn't been around for over five years. Time to bring the text up to the current decade. Cc: stable@vger.kernel.org Reported-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* Merge branch 'tip/perf/core' of ↵Ingo Molnar2013-02-204-20/+94
|\ | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent Pull two fixes from Steven Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * ftrace: Call ftrace cleanup module notifier after all other notifiersSteven Rostedt (Red Hat)2013-02-191-14/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit: c1bf08ac "ftrace: Be first to run code modification on modules" changed ftrace module notifier's priority to INT_MAX in order to process the ftrace nops before anything else could touch them (namely kprobes). This was the correct thing to do. Unfortunately, the ftrace module notifier also contains the ftrace clean up code. As opposed to the set up code, this code should be run *after* all the module notifiers have run in case a module is doing correct clean-up and unregisters its ftrace hooks. Basically, ftrace needs to do clean up on module removal, as it needs to know about code being removed so that it doesn't try to modify that code. But after it removes the module from its records, if a ftrace user tries to remove a probe, that removal will fail due as the record of that code segment no longer exists. Nothing really bad happens if the probe removal is called after ftrace did the clean up, but the ftrace removal function will return an error. Correct code (such as kprobes) will produce a WARN_ON() if it fails to remove the probe. As people get annoyed by frivolous warnings, it's best to do the ftrace clean up after everything else. By splitting the ftrace_module_notifier into two notifiers, one that does the module load setup that is run at high priority, and the other that is called for module clean up that is run at low priority, the problem is solved. Cc: stable@vger.kernel.org Reported-by: Frank Ch. Eigler <fche@redhat.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing/syscalls: Allow archs to ignore tracing compat syscallsSteven Rostedt2013-02-123-6/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tracing of ia32 compat system calls has been a bit of a pain as they use different system call numbers than the 64bit equivalents. I wrote a simple 'lls' program that lists files. I compiled it as a i686 ELF binary and ran it under a x86_64 box. This is the result: echo 0 > /debug/tracing/tracing_on echo 1 > /debug/tracing/events/syscalls/enable echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on grep lls /debug/tracing/trace [.. skipping calls before TS_COMPAT is set ...] lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420) lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000 lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000 lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4) lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000) lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060) lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0 lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2) lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000 lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614) lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0 lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000) lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3 lls-1127 [005] d... 936.409279: sys_close(fd: 3) lls-1127 [005] d... 936.421550: sys_close -> 0x200 lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0) lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0 lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802) lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000 lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000) lls-1127 [005] d... 936.421580: sys_capget -> 0x0 lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812) lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000 lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32) lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000 lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0) lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0 lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22) Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are obviously incorrect, and confusing. Other efforts have been made to fix this: https://lkml.org/lkml/2012/3/26/367 But the real solution is to rewrite the syscall internals and come up with a fixed solution. One that doesn't require all the kluge that the current solution has. Thus for now, instead of outputting incorrect data, simply ignore them. With this patch the changes now have: #> grep lls /debug/tracing/trace #> Compat system calls simply are not traced. If users need compat syscalls, then they should just use the raw syscall tracepoints. For an architecture to make their compat syscalls ignored, it must define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also define an arch_trace_is_compat_syscall() function that will return true if the current task should ignore tracing the syscall. I want to stress that this change does not affect actual syscalls in any way, shape or form. It is only used within the tracing system and doesn't interfere with the syscall logic at all. The changes are consolidated nicely into trace_syscalls.c and asm/ftrace.h. I had to make one small modification to asm/thread_info.h and that was to remove the include of asm/ftrace.h. As asm/ftrace.h required the current_thread_info() it was causing include hell. That include was added back in 2008 when the function graph tracer was added: commit caf4b323 "tracing, x86: add low level support for ftrace return tracing" It does not need to be included there. Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | perf/x86: Add Intel IvyBridge event scheduling constraintsStephane Eranian2013-02-201-1/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Intel IvyBridge processor has different constraints compared to SandyBridge. Therefore it needs its own contraint table. This patch adds the constraint table. Without this patch, the events listed in the patch may not be scheduled correctly and bogus counts may be collected. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: ak@linux.intel.com Cc: acme@redhat.com Cc: jolsa@redhat.com Cc: namhyung.kim@lge.com Link: http://lkml.kernel.org/r/1361355312-3323-1-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'for-3.9-async' of ↵Linus Torvalds2013-02-209-109/+130
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull async changes from Tejun Heo: "These are followups for the earlier deadlock issue involving async ending up waiting for itself through block requesting module[1]. The following changes are made by these commits. - Instead of requesting default elevator on each request_queue init, block now requests it once early during boot. - Kmod triggers warning if invoked from an async worker. - Async synchronization implementation has been reimplemented. It's a lot simpler now." * 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: async: initialise list heads to fix crash async: replace list of active domains with global list of pending items async: keep pending tasks on async_domain and remove async_pending async: use ULLONG_MAX for infinity cookie value async: bring sanity to the use of words domain and running async, kmod: warn on synchronous request_module() from async workers block: don't request module during elevator init init, block: try to load default elevator module early during boot
| * | async: initialise list heads to fix crashJames Hogan2013-01-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 9fdb04cdc55 ("async: replace list of active domains with global list of pending items") added a struct list_head global_list in struct async_entry, which isn't initialised. This means that if !domain->registered at __async_schedule(), then list_del_init() will be called on the list head in async_run_entry_fn with both pointers NULL, causing a crash. This is fixed by initialising both the global_list and domain_list list_heads after kzalloc'ing the entry. This was noticed due to dapm_power_widgets() which uses ASYNC_DOMAIN_EXCLUSIVE, which initialises the domain->registered to 0. Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: James Hogan <james.hogan@imgtec.com> Reported-by: Stephen Warren <swarren@wwwdotorg.org>
| * | async: replace list of active domains with global list of pending itemsTejun Heo2013-01-231-34/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Global synchronization - async_synchronize_full() - is currently implemented by keeping a list of all active registered domains and syncing them one by one until no domain is active. While this isn't necessarily a complex scheme, it can easily be simplified by keeping global list of the pending items of all registered active domains instead of list of domains and simply using the globl pending list the same way as domain syncing. This patch replaces async_domains with async_global_pending and update lowest_in_progress() to use the global pending list if @domain is %NULL. async_synchronize_full_domain(NULL) is now allowed and equivalent to async_synchronize_full(). As no one is calling with NULL domain, this doesn't affect any existing users. async_register_mutex is no longer necessary and dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <djbw@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
| * | async: keep pending tasks on async_domain and remove async_pendingTejun Heo2013-01-232-55/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Async kept single global pending list and per-domain running lists. When an async item is queued, it's put on the global pending list. The item is moved to the per-domain running list when its execution starts. At this point, this design complicates execution and synchronization without bringing any benefit. The list only matters for synchronization which doesn't care whether a given async item is pending or executing. Also, global synchronization is done by iterating through all active registered async_domains, so the global async_pending list doesn't help anything either. Rename async_domain->running to async_domain->pending and put async items directly there and remove when execution completes. This simplifies lowest_in_progress() a lot - the first item on the pending list is the one with the lowest cookie, and async_run_entry_fn() doesn't have to mess with moving the item from pending to running. After the change, whether a domain is empty or not can be trivially determined by looking at async_domain->pending. Remove async_domain->count and use list_empty() on pending instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <djbw@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
| * | async: use ULLONG_MAX for infinity cookie valueTejun Heo2013-01-231-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, next_cookie is used as the infinity value. In most cases, this should work fine but it theoretically could bring subtle behavior difference between async_synchronize_full() and async_synchronize_full_domain(). async_synchronize_full() keeps waiting until there's no registered async_entry left regardless of what next_cookie was when the function was called. It guarantees that the queue is completely drained at least once before returning. However, async_synchronize_full_domain() doesn't. It synchronizes upto next_cookie and if further async jobs are queued after the next_cookie value to synchronize is decided, they won't be waited for. For unrelated async jobs, the behavior difference doesn't matter; however, if async jobs which are related (nested or otherwise) to the executing ones are queued while sychronization is in progress, the resulting behavior difference could be problematic. This can be easily fixed by using ULLONG_MAX as the infinity value instead. Define ASYNC_COOKIE_MAX as ULLONG_MAX and use it as the infinity value for synchronization. This makes async_synchronize_full_domain() fully drain the domain at least once before returning, making its behavior match async_synchronize_full(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <djbw@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
| * | async: bring sanity to the use of words domain and runningTejun Heo2013-01-232-37/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the beginning, running lists were literal struct list_heads. Later on, struct async_domain was added. For some reason, while the conversion substituted list_heads with async_domains, the variable names weren't fully converted. In more places, "running" was used for struct async_domain while other places adopted new "domain" name. The situation is made much worse by having async_domain's running list named "domain" and async_entry's field pointing to async_domain named "running". So, we end up with mix of "running" and "domain" for variable names for async_domain, with the field names of async_domain and async_entry swapped between "running" and "domain". It feels almost intentionally made to be as confusing as possible. Bring some sanity by * Renaming all async_domain variables "domain". * s/async_running/async_dfl_domain/ * s/async_domain->domain/async_domain->running/ * s/async_entry->running/async_entry->domain/ Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <djbw@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
| * | Merge branch 'master' into for-3.9-asyncTejun Heo2013-01-233468-31304/+74116
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | To receive f56c3196f251012de9b3ebaff55732a9074fdaae ("async: fix __lowest_in_progress()"). Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | async, kmod: warn on synchronous request_module() from async workersTejun Heo2013-01-231-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Synchronous requet_module() from an async worker can lead to deadlock because module init path may invoke async_synchronize_full(). The async worker waits for request_module() to complete and the module loading waits for the async task to finish. This bug happened in the block layer because of default elevator auto-loading. Block layer has been updated not to do default elevator auto-loading and it has been decided to disallow synchronous request_module() from async workers. Trigger WARN_ON_ONCE() on synchronous request_module() from async workers. For more details, please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel/1420814 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Alex Riesen <raa.lkml@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com>
| * | | block: don't request module during elevator initTejun Heo2013-01-231-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block layer allows selecting an elevator which is built as a module to be selected as system default via kernel param "elevator=". This is achieved by automatically invoking request_module() whenever a new block device is initialized and the elevator is not available. This led to an interesting deadlock problem involving async and module init. Block device probing running off an async job invokes request_module(). While the module is being loaded, it performs async_synchronize_full() which ends up waiting for the async job which is already waiting for request_module() to finish, leading to deadlock. Invoking request_module() from deep in block device init path is already nasty in itself. It seems best to avoid these situations from the beginning by moving on-demand module loading out of block init path. The previous patch made sure that the default elevator module is loaded early during boot if available. This patch removes on-demand loading of the default elevator from elevator init path. As the module would have been loaded during boot, userland-visible behavior difference should be minimal. For more details, please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel/1420814 v2: The bool parameter was named @request_module which conflicted with request_module(). This built okay w/ CONFIG_MODULES because request_module() was defined as a macro. W/o CONFIG_MODULES, it causes build breakage. Rename the parameter to @try_loading. Reported by Fengguang. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alex Riesen <raa.lkml@gmail.com> Cc: Fengguang Wu <fengguang.wu@intel.com>