x86/mce: Handle broadcasted MCE gracefully with kexec

When we are about to kexec a crash kernel and right then and there a broadcasted MCE fires while we're still in the first kernel and while the other CPUs remain in a holding pattern, the #MC handler of the first kernel will timeout and then panic due to never completing MCE synchronization. Handle this in a similar way as to when the CPUs are offlined when that broadcasted MCE happens. [ Boris: rewrote commit message and comments. ] Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: kexec@lists.infradead.org Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1487857012-9059-1-git-send-email-xlpang@redhat.com Link: http://lkml.kernel.org/r/20170313095019.19351-1-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
author: Xunlei Pang <xlpang@redhat.com> 2017-03-13 10:50:19 +0100
committer: Thomas Gleixner <tglx@linutronix.de> 2017-03-13 20:18:07 +0100
commit: 5bc329503e8191c91c4c40836f062ef771d8ba83 (patch)
tree: 929ecf39268564f1da21b3846fa588a27c152f91 /arch/x86/kernel/cpu/mcheck
parent: Linux 4.11-rc2 (diff)
download: linux-5bc329503e8191c91c4c40836f062ef771d8ba83.tar.xz
linux-5bc329503e8191c91c4c40836f062ef771d8ba83.zip
1 files changed, 16 insertions, 2 deletions
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8e9725c607ea..177472ace838 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -49,6 +49,7 @@
 #include <asm/tlbflush.h>
 #include <asm/mce.h>
 #include <asm/msr.h>
+#include <asm/reboot.h>
 
 #include "mce-internal.h"
 
@@ -1127,9 +1128,22 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * on Intel.
 	 */
 	int lmce = 1;
+	int cpu = smp_processor_id();
 
-	/* If this CPU is offline, just bail out. */
-	if (cpu_is_offline(smp_processor_id())) {
+	/*
+	 * Cases where we avoid rendezvous handler timeout:
+	 * 1) If this CPU is offline.
+	 *
+	 * 2) If crashing_cpu was set, e.g. we're entering kdump and we need to
+	 *  skip those CPUs which remain looping in the 1st kernel - see
+	 *  crash_nmi_callback().
+	 *
+	 * Note: there still is a small window between kexec-ing and the new,
+	 * kdump kernel establishing a new #MC handler where a broadcasted MCE
+	 * might not get handled properly.
+	 */
+	if (cpu_is_offline(cpu) ||
+	    (crashing_cpu != -1 && crashing_cpu != cpu)) {
 		u64 mcgstatus;
 
 		mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
author	Xunlei Pang <xlpang@redhat.com>	2017-03-13 10:50:19 +0100
committer	Thomas Gleixner <tglx@linutronix.de>	2017-03-13 20:18:07 +0100
commit	5bc329503e8191c91c4c40836f062ef771d8ba83 (patch)
tree	929ecf39268564f1da21b3846fa588a27c152f91 /arch/x86/kernel/cpu/mcheck
parent	Linux 4.11-rc2 (diff)
download	linux-5bc329503e8191c91c4c40836f062ef771d8ba83.tar.xz linux-5bc329503e8191c91c4c40836f062ef771d8ba83.zip