From 54fd15780526c47fa29a85b066cf69996be59a59 Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Tue, 26 May 2015 10:28:18 +0200 Subject: x86/Documentation: Move kernel-stacks doc one level up ... to Documentation/x86/ as it is going to collect more and not only 64-bit specific info. Signed-off-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Michal Marek Cc: Peter Zijlstra Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/1432628901-18044-16-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar --- Documentation/x86/kernel-stacks | 101 +++++++++++++++++++++++++++++++++ Documentation/x86/x86_64/kernel-stacks | 101 --------------------------------- 2 files changed, 101 insertions(+), 101 deletions(-) create mode 100644 Documentation/x86/kernel-stacks delete mode 100644 Documentation/x86/x86_64/kernel-stacks diff --git a/Documentation/x86/kernel-stacks b/Documentation/x86/kernel-stacks new file mode 100644 index 000000000000..e3c8a49d1a2f --- /dev/null +++ b/Documentation/x86/kernel-stacks @@ -0,0 +1,101 @@ +Most of the text from Keith Owens, hacked by AK + +x86_64 page size (PAGE_SIZE) is 4K. + +Like all other architectures, x86_64 has a kernel stack for every +active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. +These stacks contain useful data as long as a thread is alive or a +zombie. While the thread is in user space the kernel stack is empty +except for the thread_info structure at the bottom. + +In addition to the per thread stacks, there are specialized stacks +associated with each CPU. These stacks are only used while the kernel +is in control on that CPU; when a CPU returns to user space the +specialized stacks contain no useful data. The main CPU stacks are: + +* Interrupt stack. IRQSTACKSIZE + + Used for external hardware interrupts. If this is the first external + hardware interrupt (i.e. not a nested hardware interrupt) then the + kernel switches from the current task to the interrupt stack. Like + the split thread and interrupt stacks on i386, this gives more room + for kernel interrupt processing without having to increase the size + of every per thread stack. + + The interrupt stack is also used when processing a softirq. + +Switching to the kernel interrupt stack is done by software based on a +per CPU interrupt nest counter. This is needed because x86-64 "IST" +hardware stacks cannot nest without races. + +x86_64 also has a feature which is not available on i386, the ability +to automatically switch to a new stack for designated events such as +double fault or NMI, which makes it easier to handle these unusual +events on x86_64. This feature is called the Interrupt Stack Table +(IST). There can be up to 7 IST entries per CPU. The IST code is an +index into the Task State Segment (TSS). The IST entries in the TSS +point to dedicated stacks; each stack can be a different size. + +An IST is selected by a non-zero value in the IST field of an +interrupt-gate descriptor. When an interrupt occurs and the hardware +loads such a descriptor, the hardware automatically sets the new stack +pointer based on the IST value, then invokes the interrupt handler. If +the interrupt came from user mode, then the interrupt handler prologue +will switch back to the per-thread stack. If software wants to allow +nested IST interrupts then the handler must adjust the IST values on +entry to and exit from the interrupt handler. (This is occasionally +done, e.g. for debug exceptions.) + +Events with different IST codes (i.e. with different stacks) can be +nested. For example, a debug interrupt can safely be interrupted by an +NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack +pointers on entry to and exit from all IST events, in theory allowing +IST events with the same code to be nested. However in most cases, the +stack size allocated to an IST assumes no nesting for the same code. +If that assumption is ever broken then the stacks will become corrupt. + +The currently assigned IST stacks are :- + +* STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 12 - Stack Fault Exception (#SS). + + This allows the CPU to recover from invalid stack segments. Rarely + happens. + +* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 8 - Double Fault Exception (#DF). + + Invoked when handling one exception causes another exception. Happens + when the kernel is very confused (e.g. kernel stack pointer corrupt). + Using a separate stack allows the kernel to recover from it well enough + in many cases to still output an oops. + +* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for non-maskable interrupts (NMI). + + NMI can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for NMI events avoids making + assumptions about the previous state of the kernel stack. + +* DEBUG_STACK. DEBUG_STKSZ + + Used for hardware debug interrupts (interrupt 1) and for software + debug interrupts (INT3). + + When debugging a kernel, debug interrupts (both hardware and + software) can occur at any time. Using IST for these interrupts + avoids making assumptions about the previous state of the kernel + stack. + +* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 18 - Machine Check Exception (#MC). + + MCE can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for MCE events avoids making + assumptions about the previous state of the kernel stack. + +For more details see the Intel IA32 or AMD AMD64 architecture manuals. diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks deleted file mode 100644 index e3c8a49d1a2f..000000000000 --- a/Documentation/x86/x86_64/kernel-stacks +++ /dev/null @@ -1,101 +0,0 @@ -Most of the text from Keith Owens, hacked by AK - -x86_64 page size (PAGE_SIZE) is 4K. - -Like all other architectures, x86_64 has a kernel stack for every -active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. -These stacks contain useful data as long as a thread is alive or a -zombie. While the thread is in user space the kernel stack is empty -except for the thread_info structure at the bottom. - -In addition to the per thread stacks, there are specialized stacks -associated with each CPU. These stacks are only used while the kernel -is in control on that CPU; when a CPU returns to user space the -specialized stacks contain no useful data. The main CPU stacks are: - -* Interrupt stack. IRQSTACKSIZE - - Used for external hardware interrupts. If this is the first external - hardware interrupt (i.e. not a nested hardware interrupt) then the - kernel switches from the current task to the interrupt stack. Like - the split thread and interrupt stacks on i386, this gives more room - for kernel interrupt processing without having to increase the size - of every per thread stack. - - The interrupt stack is also used when processing a softirq. - -Switching to the kernel interrupt stack is done by software based on a -per CPU interrupt nest counter. This is needed because x86-64 "IST" -hardware stacks cannot nest without races. - -x86_64 also has a feature which is not available on i386, the ability -to automatically switch to a new stack for designated events such as -double fault or NMI, which makes it easier to handle these unusual -events on x86_64. This feature is called the Interrupt Stack Table -(IST). There can be up to 7 IST entries per CPU. The IST code is an -index into the Task State Segment (TSS). The IST entries in the TSS -point to dedicated stacks; each stack can be a different size. - -An IST is selected by a non-zero value in the IST field of an -interrupt-gate descriptor. When an interrupt occurs and the hardware -loads such a descriptor, the hardware automatically sets the new stack -pointer based on the IST value, then invokes the interrupt handler. If -the interrupt came from user mode, then the interrupt handler prologue -will switch back to the per-thread stack. If software wants to allow -nested IST interrupts then the handler must adjust the IST values on -entry to and exit from the interrupt handler. (This is occasionally -done, e.g. for debug exceptions.) - -Events with different IST codes (i.e. with different stacks) can be -nested. For example, a debug interrupt can safely be interrupted by an -NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack -pointers on entry to and exit from all IST events, in theory allowing -IST events with the same code to be nested. However in most cases, the -stack size allocated to an IST assumes no nesting for the same code. -If that assumption is ever broken then the stacks will become corrupt. - -The currently assigned IST stacks are :- - -* STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for interrupt 12 - Stack Fault Exception (#SS). - - This allows the CPU to recover from invalid stack segments. Rarely - happens. - -* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for interrupt 8 - Double Fault Exception (#DF). - - Invoked when handling one exception causes another exception. Happens - when the kernel is very confused (e.g. kernel stack pointer corrupt). - Using a separate stack allows the kernel to recover from it well enough - in many cases to still output an oops. - -* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for non-maskable interrupts (NMI). - - NMI can be delivered at any time, including when the kernel is in the - middle of switching stacks. Using IST for NMI events avoids making - assumptions about the previous state of the kernel stack. - -* DEBUG_STACK. DEBUG_STKSZ - - Used for hardware debug interrupts (interrupt 1) and for software - debug interrupts (INT3). - - When debugging a kernel, debug interrupts (both hardware and - software) can occur at any time. Using IST for these interrupts - avoids making assumptions about the previous state of the kernel - stack. - -* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for interrupt 18 - Machine Check Exception (#MC). - - MCE can be delivered at any time, including when the kernel is in the - middle of switching stacks. Using IST for MCE events avoids making - assumptions about the previous state of the kernel stack. - -For more details see the Intel IA32 or AMD AMD64 architecture manuals. -- cgit v1.2.3 From d724a9a52b0026ac6a05440c079c9a618acfd8cf Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Tue, 26 May 2015 10:28:19 +0200 Subject: x86/Documentation: Remove STACKFAULT_STACK bulletpoint Update the documentation after 6f442be2fb22 ("x86_64, traps: Stop using IST for #SS"). Signed-off-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Michal Marek Cc: Peter Zijlstra Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/1432628901-18044-17-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar --- Documentation/x86/kernel-stacks | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/Documentation/x86/kernel-stacks b/Documentation/x86/kernel-stacks index e3c8a49d1a2f..c3c935b9d56e 100644 --- a/Documentation/x86/kernel-stacks +++ b/Documentation/x86/kernel-stacks @@ -1,3 +1,6 @@ +Kernel stacks on x86-64 bit +--------------------------- + Most of the text from Keith Owens, hacked by AK x86_64 page size (PAGE_SIZE) is 4K. @@ -56,13 +59,6 @@ If that assumption is ever broken then the stacks will become corrupt. The currently assigned IST stacks are :- -* STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for interrupt 12 - Stack Fault Exception (#SS). - - This allows the CPU to recover from invalid stack segments. Rarely - happens. - * DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). Used for interrupt 8 - Double Fault Exception (#DF). -- cgit v1.2.3 From 113b5e3720e79ad938374163c1b8e295521dc9cf Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Tue, 26 May 2015 10:28:20 +0200 Subject: x86/Documentation: Adapt Ingo's explanation on printing backtraces Hold it down for future reference, as the question about the question mark in stack traces keeps popping up. Signed-off-by: Borislav Petkov Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Michal Marek Cc: Peter Zijlstra Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/1432628901-18044-18-git-send-email-bp@alien8.de Link: http://lkml.kernel.org/r/20150521101614.GA10889@gmail.com Signed-off-by: Ingo Molnar --- Documentation/x86/kernel-stacks | 44 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/Documentation/x86/kernel-stacks b/Documentation/x86/kernel-stacks index c3c935b9d56e..0f3a6c201943 100644 --- a/Documentation/x86/kernel-stacks +++ b/Documentation/x86/kernel-stacks @@ -95,3 +95,47 @@ The currently assigned IST stacks are :- assumptions about the previous state of the kernel stack. For more details see the Intel IA32 or AMD AMD64 architecture manuals. + + +Printing backtraces on x86 +-------------------------- + +The question about the '?' preceding function names in an x86 stacktrace +keeps popping up, here's an indepth explanation. It helps if the reader +stares at print_context_stack() and the whole machinery in and around +arch/x86/kernel/dumpstack.c. + +Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>: + +We always scan the full kernel stack for return addresses stored on +the kernel stack(s) [*], from stack top to stack bottom, and print out +anything that 'looks like' a kernel text address. + +If it fits into the frame pointer chain, we print it without a question +mark, knowing that it's part of the real backtrace. + +If the address does not fit into our expected frame pointer chain we +still print it, but we print a '?'. It can mean two things: + + - either the address is not part of the call chain: it's just stale + values on the kernel stack, from earlier function calls. This is + the common case. + + - or it is part of the call chain, but the frame pointer was not set + up properly within the function, so we don't recognize it. + +This way we will always print out the real call chain (plus a few more +entries), regardless of whether the frame pointer was set up correctly +or not - but in most cases we'll get the call chain right as well. The +entries printed are strictly in stack order, so you can deduce more +information from that as well. + +The most important property of this method is that we _never_ lose +information: we always strive to print _all_ addresses on the stack(s) +that look like kernel text addresses, so if debug information is wrong, +we still print out the real call chain as well - just with more question +marks than ideal. + +[*] For things like IRQ and IST stacks, we also scan those stacks, in + the right order, and try to cross from one stack into another + reconstructing the call chain. This works most of the time. -- cgit v1.2.3