summaryrefslogtreecommitdiffstats
path: root/arch/powerpc/kernel/entry_64.S (follow)
Commit message (Collapse)AuthorAgeFilesLines
* powerpc: Add code to handle soft-disabled doorbells on serverIan Munsie2013-01-101-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the logic to properly handle doorbells that come in when interrupts have been soft disabled and to replay them when interrupts are re-enabled: - masked_##_H##interrupt is modified to leave interrupts enabled when a doorbell has come in since doorbells are edge sensitive and as such won't be automatically re-raised. - __check_irq_replay now tests if a doorbell happened on book3s, and returns either 0xe80 or 0xa00 depending on whether we are the hypervisor or not. - restore_check_irq_replay now tests for the two possible server doorbell vector numbers to replay. - __replay_interrupt also adds tests for the two server doorbell vector numbers, and is modified to use a compare instruction rather than an andi. on the single bit difference between 0x500 and 0x900. The last two use a CPU feature section to avoid needlessly testing against the hypervisor vector if it is not the hypervisor, and vice versa. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Merge branch 'next' of ↵Linus Torvalds2012-12-181-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc update from Benjamin Herrenschmidt: "The main highlight is probably some base POWER8 support. There's more to come such as transactional memory support but that will wait for the next one. Overall it's pretty quiet, or rather I've been pretty poor at picking things up from patchwork and reviewing them this time around and Kumar no better on the FSL side it seems..." * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (73 commits) powerpc+of: Rename and fix OF reconfig notifier error inject module powerpc: mpc5200: Add a3m071 board support powerpc/512x: don't compile any platform DIU code if the DIU is not enabled powerpc/mpc52xx: use module_platform_driver macro powerpc+of: Export of_reconfig_notifier_[register,unregister] powerpc/dma/raidengine: add raidengine device powerpc/iommu/fsl: Add PAMU bypass enable register to ccsr_guts struct powerpc/mpc85xx: Change spin table to cached memory powerpc/fsl-pci: Add PCI controller ATMU PM support powerpc/86xx: fsl_pcibios_fixup_bus requires CONFIG_PCI drivers/virt: the Freescale hypervisor driver doesn't need to check MSR[GS] powerpc/85xx: p1022ds: Use NULL instead of 0 for pointers powerpc: Disable relocation on exceptions when kexecing powerpc: Enable relocation on during exceptions at boot powerpc: Move get_longbusy_msecs into hvcall.h and remove duplicate function powerpc: Add wrappers to enable/disable relocation on exceptions powerpc: Add set_mode hcall powerpc: Setup relocation on exceptions for bare metal systems powerpc: Move initial mfspr LPCR out of __init_LPCR powerpc: Add relocation on exception vector handlers ...
| * powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning !Li Zhong2012-11-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch tries to fix the following BUG report: [ 0.012313] BUG: MAX_STACK_TRACE_ENTRIES too low! [ 0.012318] turning off the locking correctness validator. [ 0.012321] Call Trace: [ 0.012330] [c00000017666f6d0] [c000000000012128] .show_stack+0x78/0x184 (unreliable) [ 0.012339] [c00000017666f780] [c0000000000b6348] .save_trace+0x12c/0x14c [ 0.012345] [c00000017666f800] [c0000000000b7448] .mark_lock+0x2bc/0x710 [ 0.012351] [c00000017666f8b0] [c0000000000bb198] .__lock_acquire+0x748/0xaec [ 0.012357] [c00000017666f9b0] [c0000000000bb684] .lock_acquire+0x148/0x194 [ 0.012365] [c00000017666fa80] [c00000000069371c] .mutex_lock_nested+0x84/0x4ec [ 0.012372] [c00000017666fb90] [c000000000096998] .smpboot_register_percpu_thread+0x3c/0x10c [ 0.012380] [c00000017666fc30] [c0000000009ba910] .spawn_ksoftirqd+0x28/0x48 [ 0.012386] [c00000017666fcb0] [c00000000000a98c] .do_one_initcall+0xd8/0x1d0 [ 0.012392] [c00000017666fd60] [c00000000000b1f8] .kernel_init+0x120/0x398 [ 0.012398] [c00000017666fe30] [c000000000009ad4] .ret_from_kernel_thread+0x5c/0x64 [ 0.012404] [c00000017666fa00] [c00000017666fb20] 0xc00000017666fb20 [ 0.012410] [c00000017666fa80] [c00000000069371c] .mutex_lock_nested+0x84/0x4ec [ 0.012416] [c00000017666fb90] [c000000000096998] .smpboot_register_percpu_thread+0x3c/0x10c [ 0.012422] [c00000017666fc30] [c0000000009ba910] .spawn_ksoftirqd+0x28/0x48 [ 0.012427] [c00000017666fcb0] [c00000000000a98c] .do_one_initcall+0xd8/0x1d0 [ 0.012433] [c00000017666fd60] [c00000000000b1f8] .kernel_init+0x120/0x398 [ 0.012439] [c00000017666fe30] [c000000000009ad4] .ret_from_kernel_thread+0x5c/0x64 ....... The reason is that the back chain of c00000017666fe30 (ret_from_kernel_thread) contains some invalid value, which might form a loop. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc: take dereferencing to ret_from_kernel_thread()Al Viro2012-10-221-0/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | powerpc: don't mess with r2 in copy_thread() and friendsAl Viro2012-10-151-1/+0
| | | | | | | | | | | | | | | | kernel_thread() callbacks are *not* in modules and are not going to be there. And it's not even read in ppc32 ret_from_kernel_thread(), so no need to bother with it there either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | powerpc: switch to saner kernel_execve() semanticsAl Viro2012-10-151-6/+0
|/ | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2012-10-121-0/+16
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull pile 2 of execve and kernel_thread unification work from Al Viro: "Stuff in there: kernel_thread/kernel_execve/sys_execve conversions for several more architectures plus assorted signal fixes and cleanups. There'll be more (in particular, real fixes for the alpha do_notify_resume() irq mess)..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (43 commits) alpha: don't open-code trace_report_syscall_{enter,exit} Uninclude linux/freezer.h m32r: trim masks avr32: trim masks tile: don't bother with SIGTRAP in setup_frame microblaze: don't bother with SIGTRAP in setup_rt_frame() mn10300: don't bother with SIGTRAP in setup_frame() frv: no need to raise SIGTRAP in setup_frame() x86: get rid of duplicate code in case of CONFIG_VM86 unicore32: remove pointless test h8300: trim _TIF_WORK_MASK parisc: decide whether to go to slow path (tracesys) based on thread flags parisc: don't bother looping in do_signal() parisc: fix double restarts bury the rest of TIF_IRET sanitize tsk_is_polling() bury _TIF_RESTORE_SIGMASK unicore32: unobfuscate _TIF_WORK_MASK mips: NOTIFY_RESUME is not needed in TIF masks mips: merge the identical "return from syscall" per-ABI code ... Conflicts: arch/arm/include/asm/thread_info.h
| * powerpc: switch to generic sys_execve()/kernel_execve()Al Viro2012-10-011-0/+6
| | | | | | | | | | | | | | | | the only non-obvious part is that current_pt_regs() is really needed here - task_pt_regs() is NULL for kernel threads; it's OK for ptrace uses (the thing task_pt_regs() is intended for), but not for us. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * powerpc: split ret_from_forkAl Viro2012-10-011-0/+10
| | | | | | | | | | | | ... and get rid of in-kernel syscalls in kernel_thread() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | powerpc/kprobe: Complete kprobe and migrate exception frameTiejun Chen2012-09-181-0/+35
|/ | | | | | | | | | | | | We can't emulate stwu since that may corrupt current exception stack. So we will have to do real store operation in the exception return code. Firstly we'll allocate a trampoline exception frame below the kprobed function stack and copy the current exception frame to the trampoline. Then we can do this real store operation to implement 'stwu', and reroute the trampoline frame to r1 to complete this exception migration. Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Restore correct DSCR in context switchAnton Blanchard2012-09-051-6/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | During a context switch we always restore the per thread DSCR value. If we aren't doing explicit DSCR management (ie thread.dscr_inherit == 0) and the default DSCR changed while the process has been sleeping we end up with the wrong value. Check thread.dscr_inherit and select the default DSCR or per thread DSCR as required. This was found with the following test case, when running with more threads than CPUs (ie forcing context switching): http://ozlabs.org/~anton/junkcode/dscr_default_test.c With the four patches applied I can run a combination of all test cases successfully at the same time: http://ozlabs.org/~anton/junkcode/dscr_default_test.c http://ozlabs.org/~anton/junkcode/dscr_explicit_test.c http://ozlabs.org/~anton/junkcode/dscr_inherit_test.c Signed-off-by: Anton Blanchard <anton@samba.org> Cc: <stable@kernel.org> # 3.0+ Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Use CURRENT_THREAD_INFO instead of open coded assemblyStuart Yoder2012-07-111-6/+6
| | | | | Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Clear RI and EE at the same time in system call exitAnton Blanchard2012-07-031-12/+13
| | | | | | | mtmsrd is an expensive instruction, we save a few cycles by doing it once instead of twice. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* ppc64: fix missing to check all bits of _TIF_USER_WORK_MASK in preemptTiejun Chen2012-06-291-57/+40
| | | | | | | | | | | | | | | | | | | | | | | In entry_64.S version of ret_from_except_lite, you'll notice that in the !preempt case, after we've checked MSR_PR we test for any TIF flag in _TIF_USER_WORK_MASK to decide whether to go to do_work or not. However, in the preempt case, we do a convoluted trick to test SIGPENDING only if PR was set and always test NEED_RESCHED ... but we forget to test any other bit of _TIF_USER_WORK_MASK !!! So that means that with preempt, we completely fail to test for things like single step, syscall tracing, etc... This should be fixed as the following path: - Test PR. If not set, go to resume_kernel, else continue. - If go resume_kernel, to do that original do_work. - If else, then always test for _TIF_USER_WORK_MASK to decide to do that original user_work, else restore directly. Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Merge branch 'merge' into nextBenjamin Herrenschmidt2012-05-141-13/+31
|\ | | | | | | We want the irq fixes from the "merge" branch.
| * powerpc/irq: Fix another case of lazy IRQ state getting out of syncBenjamin Herrenschmidt2012-05-121-13/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So we have another case of paca->irq_happened getting out of sync with the HW irq state. This can happen when a perfmon interrupt occurs while soft disabled, as it will return to a soft disabled but hard enabled context while leaving a stale PACA_IRQ_HARD_DIS flag set. This patch fixes it, and also adds a test for the condition of those flags being out of sync in arch_local_irq_restore() when CONFIG_TRACE_IRQFLAGS is enabled. This helps catching those gremlins faster (and so far I can't seem see any anymore, so that's good news). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | Merge branch 'merge' into nextBenjamin Herrenschmidt2012-05-091-18/+0
|\|
| * powerpc/irq: Fix bug with new lazy IRQ handling codeBenjamin Herrenschmidt2012-05-091-18/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | We had a case where we could turn on hard interrupts while leaving the PACA_IRQ_HARD_DIS bit set in the PACA. This can in turn cause a BUG_ON() to hit in __check_irq_replay() due to interrupt state getting out of sync. The assembly code was also way too convoluted. Instead, we now leave it to the C code to do the right thing which ends up being smaller and more readable. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc: Better scheduling of CR save code in system call pathAnton Blanchard2012-04-301-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment system call entry looks like: crclr so ... mfcr r9 ... std r9,_CCR(r1) commit bd19c8994a82 ([POWERPC] system call micro optimisation) put some space between the crclr and mfcr in order to avoid a stall. There is still a stall seen between the mfcr and std. We can avoid the crclr by doing it in a GPR with rlwinm which gives us more room to better schedule the sequence. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc: No need to preserve count register across system callAnton Blanchard2012-04-301-2/+1
| | | | | | | | | | | | | | | | The count register is volatile so we don't need to preserve it. Store zero to the entry in the exception frame. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc: No need to save XER in a system callAnton Blanchard2012-04-301-2/+1
| | | | | | | | | | | | | | | | | | | | | | The XER is a volatile register so there is no need to save and restore it over a system call - zero it out in the exception stack frame instead. This should fix a 5 cycle stall of the mfxer/std seen on POWER7. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc: Hide some system call labels from profile toolsAnton Blanchard2012-04-301-4/+4
|/ | | | | | | | syscall_dotrace_cont and syscall_error_cont tend to complicate perf output so make them local. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Rework lazy-interrupt handlingBenjamin Herrenschmidt2012-03-091-34/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
* powerpc: Replace mfmsr instructions with load from PACA kernel_msr fieldBenjamin Herrenschmidt2012-03-091-9/+5
| | | | | | | | On 64-bit, the mfmsr instruction can be quite slow, slower than loading a field from the cache-hot PACA, which happens to already contain the value we want in most cases. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Improve 64-bit syscall entry/exitBenjamin Herrenschmidt2012-03-091-20/+23
| | | | | | | | | | | | | | | | We unconditionally hard enable interrupts. This is unnecessary as syscalls are expected to always be called with interrupts enabled. While at it, we add a WARN_ON if that is not the case and CONFIG_TRACE_IRQFLAGS is enabled (we don't want to add overhead to the fast path when this is not set though). Thus let's remove the enabling (and associated irq tracing) from the syscall entry path. Also on Book3S, replace a few mfmsr instructions with loads of PACAMSR from the PACA, which should be faster & schedule better. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Remove legacy iSeries bits from assembly filesBenjamin Herrenschmidt2012-03-091-41/+1
| | | | | | | | | This removes the various bits of assembly in the kernel entry, exception handling and SLB management code that were specific to running under the legacy iSeries hypervisor which is no longer supported. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Fix various issues with return to userspaceBenjamin Herrenschmidt2012-02-221-1/+5
| | | | | | | | | | | | | | | | | | | | We have a few problems when returning to userspace. This is a quick set of fixes for 3.3, I'll look into a more comprehensive rework for 3.4. This fixes: - We kept interrupts soft-disabled when schedule'ing or calling do_signal when returning to userspace as a result of a hardware interrupt. - Rename do_signal to do_notify_resume like all other archs (and do_signal_pending back to do_signal, which it was before Roland changed it). - Add the missing call to key_replace_session_keyring() to do_notify_resume(). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> ---
* powerpc: Free up some CPU feature bits by moving out MMU-related featuresMatt Evans2011-04-271-4/+4
| | | | | | | | | Some of the 64bit PPC CPU features are MMU-related, so this patch moves them to MMU_FTR_ bits. All cpu_has_feature()-style tests are moved to mmu_has_feature(), and seven feature bits are freed as a result. Signed-off-by: Matt Evans <matt@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Per process DSCR + some fixes (try#4)Alexey Kardashevskiy2011-04-271-0/+15
| | | | | | | | | | | | | | | | | | The DSCR (aka Data Stream Control Register) is supported on some server PowerPC chips and allow some control over the prefetch of data streams. This patch allows the value to be specified per thread by emulating the corresponding mfspr and mtspr instructions. Children of such threads inherit the value. Other threads use a default value that can be specified in sysfs - /sys/devices/system/cpu/dscr_default. If a thread starts with non default value in the sysfs entry, all children threads inherit this non default value even if the sysfs value is changed later. Signed-off-by: Alexey Kardashevskiy <aik@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: In HV mode, use HSPRG0 for PACABenjamin Herrenschmidt2011-04-201-2/+2
| | | | | | | | | When running in Hypervisor mode (arch 2.06 or later), we store the PACA in HSPRG0 instead of SPRG1. The architecture specifies that SPRGs may be lost during a "nap" power management operation (though they aren't currently on POWER7) and this enables use of SPRG1 by KVM guests. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Account time using timebase rather than PURRPaul Mackerras2010-09-021-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the PURR register for measuring the user and system time used by processes, as well as other related times such as hardirq and softirq times. This turns out to be quite confusing for users because it means that a program will often be measured as taking less time when run on a multi-threaded processor (SMT2 or SMT4 mode) than it does when run on a single-threaded processor (ST mode), even though the program takes longer to finish. The discrepancy is accounted for as stolen time, which is also confusing, particularly when there are no other partitions running. This changes the accounting to use the timebase instead, meaning that the reported user and system times are the actual number of real-time seconds that the program was executing on the processor thread, regardless of which SMT mode the processor is in. Thus a program will generally show greater user and system times when run on a multi-threaded processor than on a single-threaded processor. On pSeries systems on POWER5 or later processors, we measure the stolen time (time when this partition wasn't running) using the hypervisor dispatch trace log. We check for new entries in the log on every entry from user mode and on every transition from kernel process context to soft or hard IRQ context (i.e. when account_system_vtime() gets called). So that we can correctly distinguish time stolen from user time and time stolen from system time, without having to check the log on every exit to user mode, we store separate timestamps for exit to user mode and entry from user mode. On systems that have a SPURR (POWER6 and POWER7), we read the SPURR in account_system_vtime() (as before), and then apportion the SPURR ticks since the last time we read it between scaled user time and scaled system time according to the relative proportions of user time and system time over the same interval. This avoids having to read the SPURR on every kernel entry and exit. On systems that have PURR but not SPURR (i.e., POWER5), we do the same using the PURR rather than the SPURR. This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl for now since it conflicts with the use of the dispatch trace log by the time accounting code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Feature nop out reservation clear when stcx checks addressAnton Blanchard2010-09-021-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The POWER architecture does not require stcx to check that it is operating on the same address as the larx. This means it is possible for an an exception handler to execute a larx, get a reservation, decide not to do the stcx and then return back with an active reservation. If the interrupted code was in the middle of a larx/stcx sequence the stcx could incorrectly succeed. All recent POWER CPUs check the address before letting the stcx succeed so we can create a CPU feature and nop it out. As Ben suggested, we can only do this in our syscall path because there is a remote possibility some kernel code gets interrupted by an exception that ends up operating on the same cacheline. Thanks to Paul Mackerras and Derek Williams for the idea. To test this I used a very simple null syscall (actually getppid) testcase at http://ozlabs.org/~anton/junkcode/null_syscall.c I tested against 2.6.35-git10 with the following changes against the pseries_defconfig: CONFIG_VIRT_CPU_ACCOUNTING=n CONFIG_AUDIT=n CONFIG_PPC_4K_PAGES=n CONFIG_PPC_64K_PAGES=y CONFIG_FORCE_MAX_ZONEORDER=9 CONFIG_PPC_SUBPAGE_PROT=n CONFIG_FUNCTION_TRACER=n CONFIG_FUNCTION_GRAPH_TRACER=n CONFIG_IRQSOFF_TRACER=n CONFIG_STACK_TRACER=n to remove the overhead of virtual CPU accounting, syscall auditing and the ftrace mcount tracers. 64kB pages were enabled to minimise TLB misses. POWER6: +8.2% POWER7: +7.0% Another suggestion was to use a larx to something in the L1 instead of a stcx. This was almost as fast as removing the larx on POWER6, but only 3.5% faster on POWER7. We can use this to speed up the reservation clear in our exception exit code. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/perf_event: Fix oops due to perf_event_do_pending callPaul Mackerras2010-05-121-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Anton Blanchard found that large POWER systems would occasionally crash in the exception exit path when profiling with perf_events. The symptom was that an interrupt would occur late in the exit path when the MSR[RI] (recoverable interrupt) bit was clear. Interrupts should be hard-disabled at this point but they were enabled. Because the interrupt was not recoverable the system panicked. The reason is that the exception exit path was calling perf_event_do_pending after hard-disabling interrupts, and perf_event_do_pending will re-enable interrupts. The simplest and cleanest fix for this is to use the same mechanism that 32-bit powerpc does, namely to cause a self-IPI by setting the decrementer to 1. This means we can remove the tests in the exception exit path and raw_local_irq_restore. This also makes sure that the call to perf_event_do_pending from timer_interrupt() happens within irq_enter/irq_exit. (Note that calling perf_event_do_pending from timer_interrupt does not mean that there is a possible 1/HZ latency; setting the decrementer to 1 ensures that the timer interrupt will happen immediately, i.e. within one timebase tick, which is a few nanoseconds or 10s of nanoseconds.) Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: stable@kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Clear MSR_RI during RTAS callsAnton Blanchard2010-02-091-2/+1
| | | | | | | | | | RTAS should never cause an exception but if it does (for example accessing outside our RMO) then we might go a long way through the kernel before oopsing. If we unset MSR_RI we should at least stop things on exception exit. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Merge branches 'perf/powerpc' and 'perf/bench' into perf/coreIngo Molnar2009-11-151-2/+2
|\ | | | | | | | | | | | | Merge reason: Both 'perf bench' and the pending PowerPC changes are now ready for the next merge window. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * powerpc: perf_event: Hide iseries_check_pending_irqsAnton Blanchard2009-10-281-2/+2
| | | | | | | | | | | | | | | | | | | | If CONFIG_PPC_ISERIES isn't defined we end up with iseries_check_pending_irqs and do_work at the same address. perf ends up picking iseries_check_pending_irqs which creates confusing backtraces. Hide it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
* | powerpc/ppc64: Use preempt_schedule_irq instead of preempt_scheduleBenjamin Herrenschmidt2009-10-271-20/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on an original patch by Valentine Barshak <vbarshak@ru.mvista.com> Use preempt_schedule_irq to prevent infinite irq-entry and eventual stack overflow problems with fast-paced IRQ sources. This kind of problems has been observed on the PASemi Electra IDE controller. We have to make sure we are soft-disabled before calling preempt_schedule_irq and hard disable interrupts after that to avoid unrecoverable exceptions. This patch also moves the "clrrdi r9,r1,THREAD_SHIFT" out of the #ifdef CONFIG_PPC_BOOK3E scope, since r9 is clobbered and has to be restored in both cases. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc64/ftrace: use PACA to retrieve TOC in mod_return_to_handlerSteven Rostedt2009-10-131-2/+1
|/ | | | | | | | | | | | | The mod_return_to_handler needs to switch to the kernel TOC before jumping to a the kernel code. It currently does this by looking at the kernel function data and retrieves the TOC that way. Not only is this inefficient, it also breaks with a relocatable kernel. The PACA contains the kernel TOC and we can easily retrieve it that way. Reported-by: Sachin Sant <sachinp@in.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* perf: Do the big rename: Performance Counters -> Performance EventsIngo Molnar2009-09-211-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bye-bye Performance Counters, welcome Performance Events! In the past few months the perfcounters subsystem has grown out its initial role of counting hardware events, and has become (and is becoming) a much broader generic event enumeration, reporting, logging, monitoring, analysis facility. Naming its core object 'perf_counter' and naming the subsystem 'perfcounters' has become more and more of a misnomer. With pending code like hw-breakpoints support the 'counter' name is less and less appropriate. All in one, we've decided to rename the subsystem to 'performance events' and to propagate this rename through all fields, variables and API names. (in an ABI compatible fashion) The word 'event' is also a bit shorter than 'counter' - which makes it slightly more convenient to write/handle as well. Thanks goes to Stephane Eranian who first observed this misnomer and suggested a rename. User-space tooling and ABI compatibility is not affected - this patch should be function-invariant. (Also, defconfigs were not touched to keep the size down.) This patch has been generated via the following script: FILES=$(find * -type f | grep -vE 'oprofile|[^K]config') sed -i \ -e 's/PERF_EVENT_/PERF_RECORD_/g' \ -e 's/PERF_COUNTER/PERF_EVENT/g' \ -e 's/perf_counter/perf_event/g' \ -e 's/nb_counters/nb_events/g' \ -e 's/swcounter/swevent/g' \ -e 's/tpcounter_event/tp_event/g' \ $FILES for N in $(find . -name perf_counter.[ch]); do M=$(echo $N | sed 's/perf_counter/perf_event/g') mv $N $M done FILES=$(find . -name perf_event.*) sed -i \ -e 's/COUNTER_MASK/REG_MASK/g' \ -e 's/COUNTER/EVENT/g' \ -e 's/\<event\>/event_id/g' \ -e 's/counter/event/g' \ -e 's/Counter/Event/g' \ $FILES ... to keep it as correct as possible. This script can also be used by anyone who has pending perfcounters patches - it converts a Linux kernel tree over to the new naming. We tried to time this change to the point in time where the amount of pending patches is the smallest: the end of the merge window. Namespace clashes were fixed up in a preparatory patch - and some stylistic fallout will be fixed up in a subsequent patch. ( NOTE: 'counters' are still the proper terminology when we deal with hardware registers - and these sed scripts are a bit over-eager in renaming them. I've undone some of that, but in case there's something left where 'counter' would be better than 'event' we can undo that on an individual basis instead of touching an otherwise nicely automated patch. ) Suggested-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <linux-arch@vger.kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* powerpc: Remaining 64-bit Book3E supportBenjamin Herrenschmidt2009-08-201-5/+55
| | | | | | | | This contains all the bits that didn't fit in previous patches :-) This includes the actual exception handlers assembly, the changes to the kernel entry, other misc bits and wiring it all up in Kconfig. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/of: Remove useless register save/restore when calling OF backBenjamin Herrenschmidt2009-08-201-32/+6
| | | | | | | | | enter_prom() used to save and restore registers such as CTR, XER etc.. which are volatile, or SRR0,1... which we don't care about. This removes a bunch of useless code and while at it turns an mtmsrd into an MTMSRD macro which will be useful to Book3E. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Use names rather than numbers for SPRGs (v2)Benjamin Herrenschmidt2009-08-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | The kernel uses SPRG registers for various purposes, typically in low level assembly code as scratch registers or to hold per-cpu global infos such as the PACA or the current thread_info pointer. We want to be able to easily shuffle the usage of those registers as some implementations have specific constraints realted to some of them, for example, some have userspace readable aliases, etc.. and the current choice isn't always the best. This patch should not change any code generation, and replaces the usage of SPRN_SPRGn everywhere in the kernel with a named replacement and adds documentation next to the definition of the names as to what those are used for on each processor family. The only parts that still use the original numbers are bits of KVM or suspend/resume code that just blindly needs to save/restore all the SPRGs. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Merge branch 'linus' into perfcounters/core-v2Ingo Molnar2009-04-061-3/+86
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge reason: we have gathered quite a few conflicts, need to merge upstream Conflicts: arch/powerpc/kernel/Makefile arch/x86/ia32/ia32entry.S arch/x86/include/asm/hardirq.h arch/x86/include/asm/unistd_32.h arch/x86/include/asm/unistd_64.h arch/x86/kernel/cpu/common.c arch/x86/kernel/irq.c arch/x86/kernel/syscall_table_32.S arch/x86/mm/iomap_32.c include/linux/sched.h kernel/Makefile Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * powerpc64, ftrace: save toc only on modules for function graphSteven Rostedt2009-02-231-1/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TOCS used by modules are different than the one used by the core kernel code. The function graph tracer must save and restore the TOC whenever it traces a module call. But this is an added overhead to burden the majority of core kernel code being traced. Benjamin Herrenschmidt suggested in testing the entry of the call to tell if it is a core kernel function or a module. He recommended using the REGION_ID() macro to perform this test. This patch implements Benjamin's idea, and uses a different return_to_handler routine dependent on if the entry is a core kernel function or not. The module version saves the TOC, where as the core kernel version does not. Geoff Lavand tested on PS3. Tested-by: Geoff Levand <geoffrey.levand@am.sony.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * powerpc64, tracing: add function graph tracer with dynamic tracingSteven Rostedt2009-02-231-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | This is the port of the function graph tracer to PowerPC with dynamic tracing. Geoff Lavand tested on PS3. Tested-by: Geoff Levand <geoffrey.levand@am.sony.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * powerpc64: port of the function graph tracerSteven Rostedt2009-02-231-3/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a port of the function graph tracer that was written by Frederic Weisbecker for the x86. This only works for PPC64 at the moment and only for static tracing. PPC32 and dynamic function graph tracing support will come later. The trace produces a visual calling of functions: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) 2.224 us | } 0) ! 271.024 us | } 0) ! 320.080 us | } 0) ! 324.656 us | } 0) ! 329.136 us | } 0) | .put_prev_task_fair() { 0) | .update_curr() { 0) 2.240 us | .update_min_vruntime(); 0) 6.512 us | } 0) 2.528 us | .__enqueue_entity(); 0) + 15.536 us | } 0) | .pick_next_task_fair() { 0) 2.032 us | .__pick_next_entity(); 0) 2.064 us | .__clear_buddies(); 0) | .set_next_entity() { 0) 2.672 us | .__dequeue_entity(); 0) 6.864 us | } Geoff Lavand tested on PS3. Tested-by: Geoff Levand <geoffrey.levand@am.sony.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc: Provide a way to defer perf counter work until interrupts are enabledPaul Mackerras2009-01-091-0/+9
|/ | | | | | | | | | | | | | | | | | | | Because 64-bit powerpc uses lazy (soft) interrupt disabling, it is possible for a performance monitor exception to come in when the kernel thinks interrupts are disabled (i.e. when they are soft-disabled but hard-enabled). In such a situation the performance monitor exception handler might have some processing to do (such as process wakeups) which can't be done in what is effectively an NMI handler. This provides a way to defer that work until interrupts get enabled, either in raw_local_irq_restore() or by returning from an interrupt handler to code that had interrupts enabled. We have a per-processor flag that indicates that there is work pending to do when interrupts subsequently get re-enabled. This flag is checked in the interrupt return path and in raw_local_irq_restore(), and if it is set, perf_counter_do_pending() is called to do the pending work. Signed-off-by: Paul Mackerras <paulus@samba.org>
* Merge commit 'v2.6.28-rc7' into tracing/coreIngo Molnar2008-12-041-1/+7
|\
| * powerpc: Fix system calls on Cell entered with XER.SO=1Paul Mackerras2008-11-301-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out that on Cell, on a kernel with CONFIG_VIRT_CPU_ACCOUNTING = y, if a program sets the SO (summary overflow) bit in the XER and then does a system call, the SO bit in CR0 will be set on return regardless of whether the system call detected an error. Since CR0.SO is used as the error indication from the system call, this means that all system calls appear to fail. The reason is that the workaround for the timebase bug on Cell uses a compare instruction. With CONFIG_VIRT_CPU_ACCOUNTING = y, the ACCOUNT_CPU_USER_ENTRY macro reads the timebase, so we end up doing a compare instruction, which copies XER.SO to CR0.SO. Since we were doing this in the system call entry patch after clearing CR0.SO but before saving the CR, this meant that the saved CR image had CR0.SO set if XER.SO was set on entry. This fixes it by moving the clearing of CR0.SO to after the ACCOUNT_CPU_USER_ENTRY call in the system call entry path. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | powerpc: ftrace, do nothing in mcount call for dyn ftraceSteven Rostedt2008-11-281-12/+0
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: quicken mcount calls that are not replaced by dyn ftrace Dynamic ftrace no longer does on the fly recording of mcount locations. The mcount locations are now found at compile time. The mcount function no longer needs to store registers and call a stub function. It can now just simply return. Since there are some functions that do not get converted to a nop (.init sections and other code that may disappear), this patch should help speed up that code. Also, the stub for mcount on PowerPC 32 can not be a simple branch link register like it is on PowerPC 64. According to the ABI specification: "The _mcount routine is required to restore the link register from the stack so that the profiling code can be inserted transparently, whether or not the profiled function saves the link register itself." This means that we must restore the link register that was used to make the call to mcount. The minimal mcount function for PPC32 ends up being: mcount: mflr r0 mtctr r0 lwz r0, 4(r1) mtlr r0 bctr Where we move the link register used to call mcount into the ctr register, and then restore the link register from the stack. Then we use the ctr register to jump back to the mcount caller. The r0 register is free for us to use. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>