From e6617c6ec28a17cf2f90262b835ec05b9b861400 Mon Sep 17 00:00:00 2001 From: "David S. Miller" Date: Thu, 3 Sep 2009 02:35:20 -0700 Subject: sparc64: Kill spurious NMI watchdog triggers by increasing limit to 30 seconds. This is a compromise and a temporary workaround for bootup NMI watchdog triggers some people see with qla2xxx devices present. This happens when, for example: CPU 0 is in the driver init and looping submitting mailbox commands to load the firmware, then waiting for completion. CPU 1 is receiving the device interrupts. CPU 1 is where the NMI watchdog triggers. CPU 0 is submitting mailbox commands fast enough that by the time CPU 1 returns from the device interrupt handler, a new one is pending. This sequence runs for more than 5 seconds. The problematic case is CPU 1's timer interrupt running when the barrage of device interrupts begin. Then we have: timer interrupt return for softirq checking pending, thus enable interrupts qla2xxx interrupt return qla2xxx interrupt return ... 5+ seconds pass final qla2xxx interrupt for fw load return run timer softirq return At some point in the multi-second qla2xxx interrupt storm we trigger the NMI watchdog on CPU 1 from the NMI interrupt handler. The timer softirq, once we get back to running it, is smart enough to run the timer work enough times to make up for the missed timer interrupts. However, the NMI watchdogs (both x86 and sparc) use the timer interrupt count to notice the cpu is wedged. But in the above scenerio we'll receive only one such timer interrupt even if we last all the way back to running the timer softirq. The default watchdog trigger point is only 5 seconds, which is pretty low (the softwatchdog triggers at 60 seconds). So increase it to 30 seconds for now. Signed-off-by: David S. Miller --- arch/sparc/kernel/nmi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/sparc/kernel/nmi.c b/arch/sparc/kernel/nmi.c index 2c0cc72d295b..b75bf502cd42 100644 --- a/arch/sparc/kernel/nmi.c +++ b/arch/sparc/kernel/nmi.c @@ -103,7 +103,7 @@ notrace __kprobes void perfctr_irq(int irq, struct pt_regs *regs) } if (!touched && __get_cpu_var(last_irq_sum) == sum) { local_inc(&__get_cpu_var(alert_counter)); - if (local_read(&__get_cpu_var(alert_counter)) == 5 * nmi_hz) + if (local_read(&__get_cpu_var(alert_counter)) == 30 * nmi_hz) die_nmi("BUG: NMI Watchdog detected LOCKUP", regs, panic_on_timeout); } else { -- cgit v1.2.3 From bd4352cadfacb9084c97c853b025fac010266c26 Mon Sep 17 00:00:00 2001 From: "David S. Miller" Date: Fri, 4 Sep 2009 03:38:54 -0700 Subject: sparc64: Fix bootup with mcount in some configs. Functions invoked early when booting up a cpu can't use tracing because mcount requires a valid 'current_thread_info()' and TLB mappings to be setup. The code path of sun4v_register_mondo_queues --> register_one_mondo is one such case. sun4v_register_mondo_queues already has the necessary 'notrace' annotation, but register_one_mondo does not. Normally register_one_mondo is inlined so the bug doesn't trigger, but with some config/compiler combinations, it won't be so we must properly mark it notrace. While we're here, add 'notrace' annoations to prom_printf and prom_halt so that early error handling won't have the same problem. Reported-by: Alexander Beregalov Reported-by: Leif Sawyer Signed-off-by: David S. Miller --- arch/sparc/kernel/irq_64.c | 2 +- arch/sparc/prom/misc_64.c | 2 +- arch/sparc/prom/printf.c | 7 +++---- 3 files changed, 5 insertions(+), 6 deletions(-) (limited to 'arch') diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c index f0ee79055409..8daab33fc17d 100644 --- a/arch/sparc/kernel/irq_64.c +++ b/arch/sparc/kernel/irq_64.c @@ -886,7 +886,7 @@ void notrace init_irqwork_curcpu(void) * Therefore you cannot make any OBP calls, not even prom_printf, * from these two routines. */ -static void __cpuinit register_one_mondo(unsigned long paddr, unsigned long type, unsigned long qmask) +static void __cpuinit notrace register_one_mondo(unsigned long paddr, unsigned long type, unsigned long qmask) { unsigned long num_entries = (qmask + 1) / 64; unsigned long status; diff --git a/arch/sparc/prom/misc_64.c b/arch/sparc/prom/misc_64.c index eedffb4fec2d..39fc6af21b7c 100644 --- a/arch/sparc/prom/misc_64.c +++ b/arch/sparc/prom/misc_64.c @@ -88,7 +88,7 @@ void prom_cmdline(void) /* Drop into the prom, but completely terminate the program. * No chance of continuing. */ -void prom_halt(void) +void notrace prom_halt(void) { #ifdef CONFIG_SUN_LDOMS if (ldom_domaining_enabled) diff --git a/arch/sparc/prom/printf.c b/arch/sparc/prom/printf.c index 660943ee4c2a..ca869266b9f3 100644 --- a/arch/sparc/prom/printf.c +++ b/arch/sparc/prom/printf.c @@ -14,14 +14,14 @@ */ #include +#include #include #include static char ppbuf[1024]; -void -prom_write(const char *buf, unsigned int n) +void notrace prom_write(const char *buf, unsigned int n) { char ch; @@ -33,8 +33,7 @@ prom_write(const char *buf, unsigned int n) } } -void -prom_printf(const char *fmt, ...) +void notrace prom_printf(const char *fmt, ...) { va_list args; int i; -- cgit v1.2.3