diff options
author | Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> | 2018-04-23 06:59:27 +0200 |
---|---|---|
committer | Michael Ellerman <mpe@ellerman.id.au> | 2018-04-24 05:54:51 +0200 |
commit | 75ecfb49516c53da00c57b9efe48fa3f5504a791 (patch) | |
tree | 98befc2c2ea59883426c1db73c029c41dadfff84 | |
parent | powerpc/powernv/npu: Do a PID GPU TLB flush when invalidating a large address... (diff) | |
download | linux-75ecfb49516c53da00c57b9efe48fa3f5504a791.tar.xz linux-75ecfb49516c53da00c57b9efe48fa3f5504a791.zip |
powerpc/mce: Fix a bug where mce loops on memory UE.
The current code extracts the physical address for UE errors and then
hooks it up into memory failure infrastructure. On successful
extraction of physical address it wrongly sets "handled = 1" which
means this UE error has been recovered. Since MCE handler gets return
value as handled = 1, it assumes that error has been recovered and
goes back to same NIP. This causes MCE interrupt again and again in a
loop leading to hard lockup.
Also, initialize phys_addr to ULONG_MAX so that we don't end up
queuing undesired page to hwpoison.
Without this patch we see:
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
...
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
...
Watchdog CPU:38 Hard LOCKUP
After this patch we see:
Severe Machine check interrupt [Not recovered]
NIP: [00007fffaae585f4] PID: 7168 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffaafe28ac
Physical address: 00002017c0bd0000
find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
Fixes: 01eaac2b0591 ("powerpc/mce: Hookup ierror (instruction) UE errors")
Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-rw-r--r-- | arch/powerpc/kernel/mce_power.c | 7 |
1 files changed, 2 insertions, 5 deletions
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c index fe6fc63251fe..38c5b4764bfe 100644 --- a/arch/powerpc/kernel/mce_power.c +++ b/arch/powerpc/kernel/mce_power.c @@ -441,7 +441,6 @@ static int mce_handle_ierror(struct pt_regs *regs, if (pfn != ULONG_MAX) { *phys_addr = (pfn << PAGE_SHIFT); - handled = 1; } } } @@ -532,9 +531,7 @@ static int mce_handle_derror(struct pt_regs *regs, * kernel/exception-64s.h */ if (get_paca()->in_mce < MAX_MCE_DEPTH) - if (!mce_find_instr_ea_and_pfn(regs, addr, - phys_addr)) - handled = 1; + mce_find_instr_ea_and_pfn(regs, addr, phys_addr); } found = 1; } @@ -572,7 +569,7 @@ static long mce_handle_error(struct pt_regs *regs, const struct mce_ierror_table itable[]) { struct mce_error_info mce_err = { 0 }; - uint64_t addr, phys_addr; + uint64_t addr, phys_addr = ULONG_MAX; uint64_t srr1 = regs->msr; long handled; |