summaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2016-04-034-13/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "This lot contains: - Some fixups for the fallout of the topology consolidation which unearthed AMD/Intel inconsistencies - Documentation for the x86 topology management - Support for AMD advanced power management bits - Two simple cleanups removing duplicated code" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add advanced power management bits x86/thread_info: Merge two !__ASSEMBLY__ sections x86/cpufreq: Remove duplicated TDP MSR macro definitions x86/Documentation: Start documenting x86 topology x86/cpu: Get rid of compute_unit_id perf/x86/amd: Cleanup Fam10h NB event constraints x86/topology: Fix AMD core count
| * x86/cpu: Add advanced power management bitsHuang Rui2016-03-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bit 11 of CPUID 8000_0007 edx is processor feedback interface. Bit 12 of CPUID 8000_0007 edx is accumulated power. Print proper names in proc/cpuinfo Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Huang Rui <ray.huang@amd.com> Cc: Tony Li <tony.li@amd.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Sherry Hurwitz <sherry.hurwitz@amd.com> Cc: Borislav Petkov <bp@suse.de> Cc: "Len Brown" <lenb@kernel.org> Link: http://lkml.kernel.org/r/1458871720-3209-1-git-send-email-ray.huang@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * x86/cpu: Get rid of compute_unit_idBorislav Petkov2016-03-293-12/+6
| | | | | | | | | | | | | | | | | | It is cpu_core_id anyway. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1458917557-8757-3-git-send-email-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * x86/topology: Fix AMD core countPeter Zijlstra2016-03-291-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out AMD gets x86_max_cores wrong when there are compute units. The issue is that Linux assumes: nr_logical_cpus = nr_cores * nr_siblings But AMD reports its CU unit as 2 cores, but then sets num_smp_siblings to 2 as well. Boris: fixup ras/mce_amd_inj.c too, to compute the Node Base Core properly, according to the new nomenclature. Fixes: 1f12e32f4cd5 ("x86/topology: Create logical package id") Reported-by: Xiong Zhou <jencce.kernel@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andreas Herrmann <aherrmann@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Link: http://lkml.kernel.org/r/20160317095220.GO6344@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | Merge branch 'acpi-processor'Rafael J. Wysocki2016-04-021-0/+3
|\ \ | |/ |/| | | | | * acpi-processor: ACPI / processor: Request native thermal interrupt handling via _OSC
| * ACPI / processor: Request native thermal interrupt handling via _OSCSrinivas Pandruvada2016-03-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several reports of freeze on enabling HWP (Hardware PStates) feature on Skylake-based systems by the Intel P-states driver. The root cause is identified as the HWP interrupts causing BIOS code to freeze. HWP interrupts use the thermal LVT which can be handled by Linux natively, but on the affected Skylake-based systems SMM will respond to it by default. This is a problem for several reasons: - On the affected systems the SMM thermal LVT handler is broken (it will crash when invoked) and a BIOS update is necessary to fix it. - With thermal interrupt handled in SMM we lose all of the reporting features of the arch/x86/kernel/cpu/mcheck/therm_throt driver. - Some thermal drivers like x86-package-temp depend on the thermal threshold interrupts signaled via the thermal LVT. - The HWP interrupts are useful for debugging and tuning performance (if the kernel can handle them). The native handling of thermal interrupts needs to be enabled because of that. This requires some way to tell SMM that the OS can handle thermal interrupts. That can be done by using _OSC/_PDC in processor scope very early during ACPI initialization. The meaning of _OSC/_PDC bit 12 in processor scope is whether or not the OS supports native handling of interrupts for Collaborative Processor Performance Control (CPPC) notifications. Since on HWP-capable systems CPPC is a firmware interface to HWP, setting this bit effectively tells the firmware that the OS will handle thermal interrupts natively going forward. For details on _OSC/_PDC refer to: http://www.intel.com/content/www/us/en/standards/processor-vendor-specific-acpi-specification.html To implement the _OSC/_PDC handshake as described, introduce a new function, acpi_early_processor_osc(), that walks the ACPI namespace looking for ACPI processor objects and invokes _OSC for them with bit 12 in the capabilities buffer set and terminates the namespace walk on the first success. Also modify intel_thermal_interrupt() to clear HWP status bits in the HWP_STATUS MSR to acknowledge HWP interrupts (which prevents them from firing continuously). Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Subject & changelog, function rename ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | mm, kasan: stackdepot implementation. Enable stackdepot for SLABAlexander Potapenko2016-03-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the stack depot and provide CONFIG_STACKDEPOT. Stack depot will allow KASAN store allocation/deallocation stack traces for memory chunks. The stack traces are stored in a hash table and referenced by handles which reside in the kasan_alloc_meta and kasan_free_meta structures in the allocated memory chunks. IRQ stack traces are cut below the IRQ entry point to avoid unnecessary duplication. Right now stackdepot support is only enabled in SLAB allocator. Once KASAN features in SLAB are on par with those in SLUB we can switch SLUB to stackdepot as well, thus removing the dependency on SLUB stack bookkeeping, which wastes a lot of memory. This patch is based on the "mm: kasan: stack depots" patch originally prepared by Dmitry Chernenkov. Joonsoo has said that he plans to reuse the stackdepot code for the mm/page_owner.c debugging facility. [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t] [aryabinin@virtuozzo.com: comment style fixes] Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | arch, ftrace: for KASAN put hard/soft IRQ entries into separate sectionsAlexander Potapenko2016-03-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KASAN needs to know whether the allocation happens in an IRQ handler. This lets us strip everything below the IRQ entry point to reduce the number of unique stack traces needed to be stored. Move the definition of __irq_entry to <linux/interrupt.h> so that the users don't need to pull in <linux/ftrace.h>. Also introduce the __softirq_entry macro which is similar to __irq_entry, but puts the corresponding functions to the .softirqentry.text section. Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'trace-v4.6' of ↵Linus Torvalds2016-03-241-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "Nothing major this round. Mostly small clean ups and fixes. Some visible changes: - A new flag was added to distinguish traces done in NMI context. - Preempt tracer now shows functions where preemption is disabled but interrupts are still enabled. Other notes: - Updates were done to function tracing to allow better performance with perf. - Infrastructure code has been added to allow for a new histogram feature for recording live trace event histograms that can be configured by simple user commands. The feature itself was just finished, but needs a round in linux-next before being pulled. This only includes some infrastructure changes that will be needed" * tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (22 commits) tracing: Record and show NMI state tracing: Fix trace_printk() to print when not using bprintk() tracing: Remove redundant reset per-CPU buff in irqsoff tracer x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.c tracing: Fix crash from reading trace_pipe with sendfile tracing: Have preempt(irqs)off trace preempt disabled functions tracing: Fix return while holding a lock in register_tracer() ftrace: Use kasprintf() in ftrace_profile_tracefs() ftrace: Update dynamic ftrace calls only if necessary ftrace: Make ftrace_hash_rec_enable return update bool tracing: Fix typoes in code comment and printk in trace_nop.c tracing, writeback: Replace cgroup path to cgroup ino tracing: Use flags instead of bool in trigger structure tracing: Add an unreg_all() callback to trigger commands tracing: Add needs_rec flag to event triggers tracing: Add a per-event-trigger 'paused' field tracing: Add get_syscall_name() tracing: Add event record param to trigger_ops.func() tracing: Make event trigger functions available tracing: Make ftrace_event_field checking functions available ...
| * | x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.cLi Bin2016-03-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Fix the misleading comment for arch/x86/kernel/ftrace.c that it had used nop instead of jmp. Signed-off-by: Li Bin <huawei.libin@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | | Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2016-03-242-3/+19
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This tree contains various perf fixes on the kernel side, plus three hw/event-enablement late additions: - Intel Memory Bandwidth Monitoring events and handling - the AMD Accumulated Power Mechanism reporting facility - more IOMMU events ... and a final round of perf tooling updates/fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) perf llvm: Use strerror_r instead of the thread unsafe strerror one perf llvm: Use realpath to canonicalize paths perf tools: Unexport some methods unused outside strbuf.c perf probe: No need to use formatting strbuf method perf help: Use asprintf instead of adhoc equivalents perf tools: Remove unused perf_pathdup, xstrdup functions perf tools: Do not include stringify.h from the kernel sources tools include: Copy linux/stringify.h from the kernel tools lib traceevent: Remove redundant CPU output perf tools: Remove needless 'extern' from function prototypes perf tools: Simplify die() mechanism perf tools: Remove unused DIE_IF macro perf script: Remove lots of unused arguments perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve perf machine: Rename perf_event__preprocess_sample to machine__resolve perf tools: Add cpumode to struct perf_sample perf tests: Forward the perf_sample in the dwarf unwind test perf tools: Remove misplaced __maybe_unused perf list: Fix documentation of :ppp perf bench numa: Fix assertion for nodes bitfield ...
| * | | x86/cpufeature, perf/x86: Add AMD Accumulated Power Mechanism feature flagHuang Rui2016-03-211-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AMD CPU family 15h model 0x60 introduces a mechanism for measuring accumulated power. It is used to report the processor power consumption and support for it is indicated by CPUID Fn8000_0007_EDX[12]. Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andreas Herrmann <herrmann.der.user@googlemail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Jacob Shin <jacob.w.shin@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Kristen Carlson Accardi <kristen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <rric@kernel.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wan Zongshun <Vincent.Wan@amd.com> Cc: spg_linux_kernel@amd.com Link: http://lkml.kernel.org/r/1452739808-11871-4-git-send-email-ray.huang@amd.com [ Resolved conflict and moved the synthetic CPUID slot to 19. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | perf/x86/amd: Move nodes_per_socket into bsp_init_amd()Huang Rui2016-03-211-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nodes_per_socket is static and it needn't be initialized many times during every CPU core init. So move its initialization into bsp_init_amd(). Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andreas Herrmann <herrmann.der.user@googlemail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Jacob Shin <jacob.w.shin@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: spg_linux_kernel@amd.com Link: http://lkml.kernel.org/r/1452739808-11871-2-git-send-email-ray.huang@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and initVikas Shivappa2016-03-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MBM init patch enumerates the Intel MBM (Memory b/w monitoring) and initializes the perf events and datastructures for monitoring the memory b/w. Its based on original patch series by Tony Luck and Kanaka Juvva. Memory bandwidth monitoring (MBM) provides OS/VMM a way to monitor bandwidth from one level of cache to another. The current patches support L3 external bandwidth monitoring. It supports both 'local bandwidth' and 'total bandwidth' monitoring for the socket. Local bandwidth measures the amount of data sent through the memory controller on the socket and total b/w measures the total system bandwidth. Extending the cache quality of service monitoring (CQM) we add two more events to the perf infrastructure: intel_cqm_llc/local_bytes - bytes sent through local socket memory controller intel_cqm_llc/total_bytes - total L3 external bytes sent The tasks are associated with a Resouce Monitoring ID (RMID) just like in CQM and OS uses a MSR write to indicate the RMID of the task during scheduling. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: fenghua.yu@intel.com Cc: h.peter.anvin@intel.com Cc: ravi.v.shankar@intel.com Cc: vikas.shivappa@intel.com Link: http://lkml.kernel.org/r/1457652732-4499-4-git-send-email-vikas.shivappa@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2016-03-2419-48/+136
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: - fix hotplug bugs - fix irq live lock - fix various topology handling bugs - fix APIC ACK ordering - fix PV iopl handling - fix speling - fix/tweak memcpy_mcsafe() return value - fix fbcon bug - remove stray prototypes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/msr: Remove unused native_read_tscp() x86/apic: Remove declaration of unused hw_nmi_is_cpu_stuck x86/oprofile/nmi: Add missing hotplug FROZEN handling x86/hpet: Use proper mask to modify hotplug action x86/apic/uv: Fix the hotplug notifier x86/apb/timer: Use proper mask to modify hotplug action x86/topology: Use total_cpus not nr_cpu_ids for logical packages x86/topology: Fix Intel HT disable x86/topology: Fix logical package mapping x86/irq: Cure live lock in fixup_irqs() x86/tsc: Prevent NULL pointer deref in calibrate_delay_is_known() x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt() x86/iopl: Fix iopl capability check on Xen PV x86/iopl/64: Properly context-switch IOPL on Xen PV selftests/x86: Add an iopl test x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe() x86/video: Don't assume all FB devices are PCI devices arch/x86/irq: Purge useless handler declarations from hw_irq.h x86: Fix misspellings in comments
| * | | x86/hpet: Use proper mask to modify hotplug actionThomas Gleixner2016-03-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Magic hex constants are a guarantee for wreckage when the defines change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | x86/apic/uv: Fix the hotplug notifierThomas Gleixner2016-03-191-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The notifier is missing the CPU_DOWN_FAILED transition. That leaves the heartbeat disabled when CPU_DOWN_PREPARE fails. It also does not handle the FROZEN transition variants. That might not be an issue for UV, but it's inconsistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dimitri Sivanich <sivanich@sgi.com>
| * | | x86/apb/timer: Use proper mask to modify hotplug actionThomas Gleixner2016-03-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Magic hex constants are a guarantee for wreckage when the defines change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | x86/topology: Use total_cpus not nr_cpu_ids for logical packagesThomas Gleixner2016-03-191-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nr_cpu_ids can be limited on the command line via nr_cpus=. That can break the logical package management because it results in a smaller number of packages, but the cpus to online are occupying the full package space as the hyper threads are enumerated after the physical cores typically. total_cpus is the real possible cpu space not limited by nr_cpus command line and gives us the proper number of packages. Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Fixes: 1f12e32f4cd5 ("x86/topology: Create logical package id") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Xiong Zhou <jencce.kernel@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andreas Herrmann <aherrmann@suse.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603181254330.3978@nanos
| * | | x86/topology: Fix Intel HT disablePeter Zijlstra2016-03-191-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As per the comment in the code; due to BIOS it is sometimes impossible to know if there actually are smp siblings until the machine is fully enumerated. So we rather overestimate the number of possible packages. Fixes: 1f12e32f4cd5 ("x86/topology: Create logical package id") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: aherrmann@suse.com Cc: jencce.kernel@gmail.com Cc: bp@alien8.de Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Link: http://lkml.kernel.org/r/20160318150538.611014173@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | x86/topology: Fix logical package mappingPeter Zijlstra2016-03-191-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | That first branch testing pkg against __max_logical_packages is wrong, because if the first pkg id is larger, then the find_first_zero will find us logical package id 0. However, if the second pkg id is indeed 0, we'll again claim it without testing if it was already taken. Also, it fails to print the mapping. Fixes: 1f12e32f4cd5 ("x86/topology: Create logical package id") Reported-by: Xiong Zhou <jencce.kernel@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: aherrmann@suse.com Cc: bp@alien8.de Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Link: http://lkml.kernel.org/r/20160317095220.GO6344@twins.programming.kicks-ass.net Link: http://lkml.kernel.org/r/20160318150538.482393396@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | x86/irq: Cure live lock in fixup_irqs()Thomas Gleixner2016-03-181-18/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Harry reported, that he's able to trigger a system freeze with cpu hot unplug. The freeze turned out to be a live lock caused by recent changes in irq_force_complete_move(). When fixup_irqs() and from there irq_force_complete_move() is called on the dying cpu, then all other cpus are in stop machine an wait for the dying cpu to complete the teardown. If there is a move of an interrupt pending then irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain mask and waits for them to clear the mask. That's obviously impossible as those cpus are firmly stuck in stop machine with interrupts disabled. I should have known that, but I completely overlooked it being concentrated on the locking issues around the vectors. And the existance of the call to __irq_complete_move() in the code, which actually sends the cleanup IPI made it reasonable to wait for that cleanup to complete. That call was bogus even before the recent changes as it was just a pointless distraction. We have to look at two cases: 1) The move_in_progress flag of the interrupt is set This means the ioapic has been updated with the new vector, but it has not fired yet. In theory there is a race: set_ioapic(new_vector) <-- Interrupt is raised before update is effective, i.e. it's raised on the old vector. So if the target cpu cannot handle that interrupt before the old vector is cleaned up, we get a spurious interrupt and in the worst case the ioapic irq line becomes stale, but my experiments so far have only resulted in spurious interrupts. But in case of cpu hotplug this should be a non issue because if the affinity update happens right before all cpus rendevouz in stop machine, there is no way that the interrupt can be blocked on the target cpu because all cpus loops first with interrupts enabled in stop machine, so the old vector is not yet cleaned up when the interrupt fires. So the only way to run into this issue is if the delivery of the interrupt on the apic/system bus would be delayed beyond the point where the target cpu disables interrupts in stop machine. I doubt that it can happen, but at least there is a theroretical chance. Virtualization might be able to expose this, but AFAICT the IOAPIC emulation is not as stupid as the real hardware. I've spent quite some time over the weekend to enforce that situation, though I was not able to trigger the delayed case. 2) The move_in_progress flag is not set and the old_domain cpu mask is not empty. That means, that an interrupt was delivered after the change and the cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have responded to it yet. In both cases we can assume that the next interrupt will arrive on the new vector, so we can cleanup the old vectors on the cpus in the old_domain cpu mask. Fixes: 98229aa36caa "x86/irq: Plug vector cleanup race" Reported-by: Harry Junior <harryjr@outlook.fr> Tested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Joe Lawrence <joe.lawrence@stratus.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603140931430.3657@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | x86/tsc: Prevent NULL pointer deref in calibrate_delay_is_known()Thomas Gleixner2016-03-181-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The topology_core_cpumask is used to find a neighbour cpu in calibrate_delay_is_known(). It might not be allocated at the first invocation of that function on the boot cpu, when CONFIG_CPUMASK_OFFSTACK is set. The mask is allocated later in native_smp_prepare_cpus. As a consequence the underlying find_next_bit() call dereferences a NULL pointer. Add a proper check to prevent this. Fixes: c25323c07345 "x86/tsc: Use topology functions" Reported-and-tested-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603180843270.3978@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | x86/iopl: Fix iopl capability check on Xen PVAndy Lutomirski2016-03-171-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iopl(3) is supposed to work if iopl is already 3, even if unprivileged. This didn't work right on Xen PV. Fix it. Reviewewd-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/8ce12013e6e4c0a44a97e316be4a6faff31bd5ea.1458162709.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/iopl/64: Properly context-switch IOPL on Xen PVAndy Lutomirski2016-03-171-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Xen PV, regs->flags doesn't reliably reflect IOPL and the exit-to-userspace code doesn't change IOPL. We need to context switch it manually. I'm doing this without going through paravirt because this is specific to Xen PV. After the dust settles, we can merge this with the 32-bit code, tidy up the iopl syscall implementation, and remove the set_iopl pvop entirely. Fixes XSA-171. Reviewewd-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/693c3bd7aeb4d3c27c92c622b7d0f554a458173c.1458162709.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | Merge branch 'x86/cleanups' into x86/urgentIngo Molnar2016-03-1713-16/+16
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | Pull in some merge window leftovers. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | x86: Fix misspellings in commentsAdam Buchbinder2016-02-2415-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: trivial@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-03-234-1/+15
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge third patch-bomb from Andrew Morton: - more ocfs2 changes - a few hotfixes - Andy's compat cleanups - misc fixes to fatfs, ptrace, coredump, cpumask, creds, eventfd, panic, ipmi, kgdb, profile, kfifo, ubsan, etc. - many rapidio updates: fixes, new drivers. - kcov: kernel code coverage feature. Like gcov, but not "prohibitively expensive". - extable code consolidation for various archs * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits) ia64/extable: use generic search and sort routines x86/extable: use generic search and sort routines s390/extable: use generic search and sort routines alpha/extable: use generic search and sort routines kernel/...: convert pr_warning to pr_warn drivers: dma-coherent: use memset_io for DMA_MEMORY_IO mappings drivers: dma-coherent: use MEMREMAP_WC for DMA_MEMORY_MAP memremap: add MEMREMAP_WC flag memremap: don't modify flags kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE mm/mprotect.c: don't imply PROT_EXEC on non-exec fs ipc/sem: make semctl setting sempid consistent ubsan: fix tree-wide -Wmaybe-uninitialized false positives kfifo: fix sparse complaints scripts/gdb: account for changes in module data structure scripts/gdb: add cmdline reader command scripts/gdb: add version command kernel: add kcov code coverage profile: hide unused functions when !CONFIG_PROC_FS hpwdt: use nmi_panic() when kernel panics in NMI handler ...
| * | | | | kernel: add kcov code coverageDmitry Vyukov2016-03-223-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kcov provides code coverage collection for coverage-guided fuzzing (randomized testing). Coverage-guided fuzzing is a testing technique that uses coverage feedback to determine new interesting inputs to a system. A notable user-space example is AFL (http://lcamtuf.coredump.cx/afl/). However, this technique is not widely used for kernel testing due to missing compiler and kernel support. kcov does not aim to collect as much coverage as possible. It aims to collect more or less stable coverage that is function of syscall inputs. To achieve this goal it does not collect coverage in soft/hard interrupts and instrumentation of some inherently non-deterministic or non-interesting parts of kernel is disbled (e.g. scheduler, locking). Currently there is a single coverage collection mode (tracing), but the API anticipates additional collection modes. Initially I also implemented a second mode which exposes coverage in a fixed-size hash table of counters (what Quentin used in his original patch). I've dropped the second mode for simplicity. This patch adds the necessary support on kernel side. The complimentary compiler support was added in gcc revision 231296. We've used this support to build syzkaller system call fuzzer, which has found 90 kernel bugs in just 2 months: https://github.com/google/syzkaller/wiki/Found-Bugs We've also found 30+ bugs in our internal systems with syzkaller. Another (yet unexplored) direction where kcov coverage would greatly help is more traditional "blob mutation". For example, mounting a random blob as a filesystem, or receiving a random blob over wire. Why not gcov. Typical fuzzing loop looks as follows: (1) reset coverage, (2) execute a bit of code, (3) collect coverage, repeat. A typical coverage can be just a dozen of basic blocks (e.g. an invalid input). In such context gcov becomes prohibitively expensive as reset/collect coverage steps depend on total number of basic blocks/edges in program (in case of kernel it is about 2M). Cost of kcov depends only on number of executed basic blocks/edges. On top of that, kernel requires per-thread coverage because there are always background threads and unrelated processes that also produce coverage. With inlined gcov instrumentation per-thread coverage is not possible. kcov exposes kernel PCs and control flow to user-space which is insecure. But debugfs should not be mapped as user accessible. Based on a patch by Quentin Casasnovas. [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode'] [akpm@linux-foundation.org: unbreak allmodconfig] [akpm@linux-foundation.org: follow x86 Makefile layout standards] Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: syzkaller <syzkaller@googlegroups.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Tavis Ormandy <taviso@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@google.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Drysdale <drysdale@google.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | x86/compat: remove is_compat_task()Andy Lutomirski2016-03-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | x86's is_compat_task always checked the current syscall type, not the task type. It has no non-arch users any more, so just remove it to avoid confusion. On x86, nothing should really be checking the task ABI. There are legitimate users for the syscall ABI and for the mm ABI. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | kvm, rt: change async pagefault code locking for PREEMPT_RTRik van Riel2016-03-221-18/+19
|/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The async pagefault wake code can run from the idle task in exception context, so everything here needs to be made non-preemptible. Conversion to a simple wait queue and raw spinlock does the trick. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | | Merge branch 'mm-pkeys-for-linus' of ↵Linus Torvalds2016-03-216-6/+301
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 protection key support from Ingo Molnar: "This tree adds support for a new memory protection hardware feature that is available in upcoming Intel CPUs: 'protection keys' (pkeys). There's a background article at LWN.net: https://lwn.net/Articles/643797/ The gist is that protection keys allow the encoding of user-controllable permission masks in the pte. So instead of having a fixed protection mask in the pte (which needs a system call to change and works on a per page basis), the user can map a (handful of) protection mask variants and can change the masks runtime relatively cheaply, without having to change every single page in the affected virtual memory range. This allows the dynamic switching of the protection bits of large amounts of virtual memory, via user-space instructions. It also allows more precise control of MMU permission bits: for example the executable bit is separate from the read bit (see more about that below). This tree adds the MM infrastructure and low level x86 glue needed for that, plus it adds a high level API to make use of protection keys - if a user-space application calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice this special case, and will set a special protection key on this memory range. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. So using protection keys the kernel is able to implement 'true' PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies PROT_READ as well. Unreadable executable mappings have security advantages: they cannot be read via information leaks to figure out ASLR details, nor can they be scanned for ROP gadgets - and they cannot be used by exploits for data purposes either. We know about no user-space code that relies on pure PROT_EXEC mappings today, but binary loaders could start making use of this new feature to map binaries and libraries in a more secure fashion. There is other pending pkeys work that offers more high level system call APIs to manage protection keys - but those are not part of this pull request. Right now there's a Kconfig that controls this feature (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled (like most x86 CPU feature enablement code that has no runtime overhead), but it's not user-configurable at the moment. If there's any serious problem with this then we can make it configurable and/or flip the default" * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/mm/pkeys: Fix mismerge of protection keys CPUID bits mm/pkeys: Fix siginfo ABI breakage caused by new u64 field x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA mm/core, x86/mm/pkeys: Add execute-only protection keys support x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags x86/mm/pkeys: Allow kernel to modify user pkey rights register x86/fpu: Allow setting of XSAVE state x86/mm: Factor out LDT init from context init mm/core, x86/mm/pkeys: Add arch_validate_pkey() mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU x86/mm/pkeys: Add Kconfig prompt to existing config option x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps x86/mm/pkeys: Dump PKRU with other kernel registers mm/core, x86/mm/pkeys: Differentiate instruction fetches x86/mm/pkeys: Optimize fault handling in access_error() mm/core: Do not enforce PKEY permissions on remote mm access um, pkeys: Add UML arch_*_access_permitted() methods mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys x86/mm/gup: Simplify get_user_pages() PTE bit handling ...
| * | | | | mm/core, x86/mm/pkeys: Add execute-only protection keys supportDave Hansen2016-02-181-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Protection keys provide new page-based protection in hardware. But, they have an interesting attribute: they only affect data accesses and never affect instruction fetches. That means that if we set up some memory which is set as "access-disabled" via protection keys, we can still execute from it. This patch uses protection keys to set up mappings to do just that. If a user calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will notice this, and set a special protection key on the memory. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. I haven't found any userspace that does this today. With this facility in place, we expect userspace to move to use it eventually. Userspace _could_ start doing this today. Any PROT_EXEC calls get converted to PROT_READ inside the kernel, and would transparently be upgraded to "true" PROT_EXEC with this code. IOW, userspace never has to do any PROT_EXEC runtime detection. This feature provides enhanced protection against leaking executable memory contents. This helps thwart attacks which are attempting to find ROP gadgets on the fly. But, the security provided by this approach is not comprehensive. The PKRU register which controls access permissions is a normal user register writable from unprivileged userspace. An attacker who can execute the 'wrpkru' instruction can easily disable the protection provided by this feature. The protection key that is used for execute-only support is permanently dedicated at compile time. This is fine for now because there is currently no API to set a protection key other than this one. Despite there being a constant PKRU value across the entire system, we do not set it unless this feature is in use in a process. That is to preserve the PKRU XSAVE 'init state', which can lead to faster context switches. PKRU *is* a user register and the kernel is modifying it. That means that code doing: pkru = rdpkru() pkru |= 0x100; mmap(..., PROT_EXEC); wrpkru(pkru); could lose the bits in PKRU that enforce execute-only permissions. To avoid this, we suggest avoiding ever calling mmap() or mprotect() when the PKRU value is expected to be unstable. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave@sr71.net> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: keescook@google.com Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/mm/pkeys: Allow kernel to modify user pkey rights registerDave Hansen2016-02-181-0/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/fpu: Allow setting of XSAVE stateDave Hansen2016-02-182-2/+159
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to modify the Protection Key rights inside the kernel, so we need to change PKRU's contents. But, if we do a plain 'wrpkru', when we return to userspace we might do an XRSTOR and wipe out the kernel's 'wrpkru'. So, we need to go after PKRU in the xsave buffer. We do this by: 1. Ensuring that we have the XSAVE registers (fpregs) in the kernel FPU buffer (fpstate) 2. Looking up the location of a given state in the buffer 3. Filling in the stat 4. Ensuring that the hardware knows that state is present there (basically that the 'init optimization' is not in place). 5. Copying the newly-modified state back to the registers if necessary. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210235.5A3139BF@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/mm: Factor out LDT init from context initDave Hansen2016-02-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The arch-specific mm_context_t is a great place to put protection-key allocation state. But, we need to initialize the allocation state because pkey 0 is always "allocated". All of the runtime initialization of mm_context_t is done in *_ldt() manipulation functions. This renames the existing LDT functions like this: init_new_context() -> init_new_context_ldt() destroy_context() -> destroy_context_ldt() and makes init_new_context() and destroy_context() available for generic use. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210234.DB34FCC5@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/mm/pkeys: Actually enable Memory Protection Keys in the CPUDave Hansen2016-02-181-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This sets the bit in 'cr4' to actually enable the protection keys feature. We also include a boot-time disable for the feature "nopku". Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid bit to appear set. At this point in boot, identify_cpu() has already run the actual CPUID instructions and populated the "cpu features" structures. We need to go back and re-run identify_cpu() to make sure it gets updated values. We *could* simply re-populate the 11th word of the cpuid data, but this is probably quick enough. Also note that with the cpu_has() check and X86_FEATURE_PKU present in disabled-features.h, we do not need an #ifdef for setup_pku(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210229.6708027C@viggo.jf.intel.com [ Small readability edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smapsDave Hansen2016-02-181-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The protection key can now be just as important as read/write permissions on a VMA. We need some debug mechanism to help figure out if it is in play. smaps seems like a logical place to expose it. arch/x86/kernel/setup.c is a bit of a weirdo place to put this code, but it already had seq_file.h and there was not a much better existing place to put it. We also use no #ifdef. If protection keys is .config'd out we will effectively get the same function as if we used the weak generic function. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Joerg Roedel <jroedel@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Mark Williamson <mwilliamson@undo-software.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/mm/pkeys: Dump PKRU with other kernel registersDave Hansen2016-02-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Protection Keys never affect kernel mappings. But, they can affect whether the kernel will fault when it touches a user mapping. The kernel doesn't touch user mappings without some careful choreography and these accesses don't generally result in oopses. But, if one does, we definitely want to have PKRU available so we can figure out if protection keys played a role. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210225.BF0D4482@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/fpu, x86/mm/pkeys: Add PKRU xsave fields and data structuresDave Hansen2016-02-161-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The protection keys register (PKRU) is saved and restored using xsave. Define the data structure that we will use to access it inside the xsave buffer. Note that we also have to widen the printk of the xsave feature masks since this is feature 0x200 and we only did two characters before. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210204.56DF8F7B@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/cpufeature, x86/mm/pkeys: Add protection keys related CPUID definitionsDave Hansen2016-02-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two CPUID bits for protection keys. One is for whether the CPU contains the feature, and the other will appear set once the OS enables protection keys. Specifically: Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable Protection keys (and the RDPKRU/WRPKRU instructions) This is because userspace can not see CR4 contents, but it can see CPUID contents. X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation: CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3] X86_FEATURE_OSPKE is "OSPKU": CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4] These are the first CPU features which need to look at the ECX word in CPUID leaf 0x7, so this patch also includes fetching that word in to the cpuinfo->x86_capability[] array. Add it to the disabled-features mask when its config option is off. Even though we are not using it here, we also extend the REQUIRED_MASK_BIT_SET() macro to keep it mirroring the DISABLED_MASK_BIT_SET() version. This means that in almost all code, you should use: cpu_has(c, X86_FEATURE_PKU) and *not* the CONFIG option. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210201.7714C250@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/fpu: Add placeholder for 'Processor Trace' XSAVE stateDave Hansen2016-02-161-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is an XSAVE state component for Intel Processor Trace (PT). But, we do not currently use it. We add a placeholder in the code for it so it is not a mystery and also so we do not need an explicit enum initialization for Protection Keys in a moment. Why don't we use it? We might end up using this at _some_ point in the future. But, this is a "system" state which requires using the currently unsupported XSAVES feature. Unlike all the other XSAVE states, PT state is also not directly tied to a thread. You might context-switch between threads, but not want to change any of the PT state. Or, you might switch between threads, and *do* want to change PT state, all depending on what is being traced. We currently just manually set some MSRs to do this PT context switching, and it is unclear whether replacing our direct MSR use with XSAVE will be a net win or loss, both in code complexity and performance. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: fenghua.yu@intel.com Cc: linux-mm@kvack.org Cc: yu-cheng.yu@intel.com Link: http://lkml.kernel.org/r/20160212210158.5E4BCAE2@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | | | | |
| | \ \ \ \
| | \ \ \ \
| | \ \ \ \
| *---. \ \ \ \ Merge branches 'x86/fpu', 'x86/mm' and 'x86/asm' into x86/pkeysIngo Molnar2016-02-1625-108/+115
| |\ \ \ \ \ \ \ | | |_|_|_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a stable basis for the pkeys patches, which touches various x86 details. Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | | | | Merge branch 'efi-core-for-linus' of ↵Linus Torvalds2016-03-211-0/+1
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The main changes are: - Use separate EFI page tables when executing EFI firmware code. This isolates the EFI context from the rest of the kernel, which has security and general robustness advantages. (Matt Fleming) - Run regular UEFI firmware with interrupts enabled. This is already the status quo under other OSs. (Ard Biesheuvel) - Various x86 EFI enhancements, such as the use of non-executable attributes for EFI memory mappings. (Sai Praneeth Prakhya) - Various arm64 UEFI enhancements. (Ard Biesheuvel) - ... various fixes and cleanups. The separate EFI page tables feature got delayed twice already, because it's an intrusive change and we didn't feel confident about it - third time's the charm we hope!" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits) x86/mm/pat: Fix boot crash when 1GB pages are not supported by the CPU x86/efi: Only map kernel text for EFI mixed mode x86/efi: Map EFI_MEMORY_{XP,RO} memory region bits to EFI page tables x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd() efi/arm*: Perform hardware compatibility check efi/arm64: Check for h/w support before booting a >4 KB granular kernel efi/arm: Check for LPAE support before booting a LPAE kernel efi/arm-init: Use read-only early mappings efi/efistub: Prevent __init annotations from being used arm64/vmlinux.lds.S: Handle .init.rodata.xxx and .init.bss sections efi/arm64: Drop __init annotation from handle_kernel_image() x86/mm/pat: Use _PAGE_GLOBAL bit for EFI page table mappings efi/runtime-wrappers: Run UEFI Runtime Services with interrupts enabled efi: Reformat GUID tables to follow the format in UEFI spec efi: Add Persistent Memory type name efi: Add NV memory attribute x86/efi: Show actual ending addresses in efi_print_memmap x86/efi/bgrt: Don't ignore the BGRT if the 'valid' bit is 0 efivars: Use to_efivar_entry efi: Runtime-wrapper: Get rid of the rtc_lock spinlock ...
| * \ \ \ \ \ \ \ Merge tag 'v4.5-rc5' into efi/core, before queueing up new changesIngo Molnar2016-02-221-0/+2
| |\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * \ \ \ \ \ \ \ \ Merge branch 'linus' into efi/core, to refresh the branch and to pick up ↵Ingo Molnar2016-02-0377-648/+1618
| |\ \ \ \ \ \ \ \ \ | | | |/ / / / / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | recent fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | x86/mm: Page align the '_end' symbol to avoid pfn conversion bugsMatt Fleming2015-11-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ingo noted that if we can guarantee _end is aligned to PAGE_SIZE we can automatically avoid bugs along the lines of, size = _end - _text >> PAGE_SHIFT which is missing a call to PFN_ALIGN(). The EFI mixed mode contains this bug, for example. _text is already aligned to PAGE_SIZE through the use of LOAD_PHYSICAL_ADDR, and the BSS and BRK sections are explicitly aligned in the linker script, so it makes sense to align _end to match. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1448658575-17029-2-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | | | | | | Merge branch 'core-objtool-for-linus' of ↵Linus Torvalds2016-03-215-34/+49
|\ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull 'objtool' stack frame validation from Ingo Molnar: "This tree adds a new kernel build-time object file validation feature (ONFIG_STACK_VALIDATION=y): kernel stack frame correctness validation. It was written by and is maintained by Josh Poimboeuf. The motivation: there's a category of hard to find kernel bugs, most of them in assembly code (but also occasionally in C code), that degrades the quality of kernel stack dumps/backtraces. These bugs are hard to detect at the source code level. Such bugs result in incorrect/incomplete backtraces most of time - but can also in some rare cases result in crashes or other undefined behavior. The build time correctness checking is done via the new 'objtool' user-space utility that was written for this purpose and which is hosted in the kernel repository in tools/objtool/. The tool's (very simple) UI and source code design is shaped after Git and perf and shares quite a bit of infrastructure with tools/perf (which tooling infrastructure sharing effort got merged via perf and is already upstream). Objtool follows the well-known kernel coding style. Objtool does not try to check .c or .S files, it instead analyzes the resulting .o generated machine code from first principles: it decodes the instruction stream and interprets it. (Right now objtool supports the x86-64 architecture.) From tools/objtool/Documentation/stack-validation.txt: "The kernel CONFIG_STACK_VALIDATION option enables a host tool named objtool which runs at compile time. It has a "check" subcommand which analyzes every .o file and ensures the validity of its stack metadata. It enforces a set of rules on asm code and C inline assembly code so that stack traces can be reliable. Currently it only checks frame pointer usage, but there are plans to add CFI validation for C files and CFI generation for asm files. For each function, it recursively follows all possible code paths and validates the correct frame pointer state at each instruction. It also follows code paths involving special sections, like .altinstructions, __jump_table, and __ex_table, which can add alternative execution paths to a given instruction (or set of instructions). Similarly, it knows how to follow switch statements, for which gcc sometimes uses jump tables." When this new kernel option is enabled (it's disabled by default), the tool, if it finds any suspicious assembly code pattern, outputs warnings in compiler warning format: warning: objtool: rtlwifi_rate_mapping()+0x2e7: frame pointer state mismatch warning: objtool: cik_tiling_mode_table_init()+0x6ce: call without frame pointer save/setup warning: objtool:__schedule()+0x3c0: duplicate frame pointer save warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer ... so that scripts that pick up compiler warnings will notice them. All known warnings triggered by the tool are fixed by the tree, most of the commits in fact prepare the kernel to be warning-free. Most of them are bugfixes or cleanups that stand on their own, but there are also some annotations of 'special' stack frames for justified cases such entries to JIT-ed code (BPF) or really special boot time code. There are two other long-term motivations behind this tool as well: - To improve the quality and reliability of kernel stack frames, so that they can be used for optimized live patching. - To create independent infrastructure to check the correctness of CFI stack frames at build time. CFI debuginfo is notoriously unreliable and we cannot use it in the kernel as-is without extra checking done both on the kernel side and on the build side. The quality of kernel stack frames matters to debuggability as well, so IMO we can merge this without having to consider the live patching or CFI debuginfo angle" * 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) objtool: Only print one warning per function objtool: Add several performance improvements tools: Copy hashtable.h into tools directory objtool: Fix false positive warnings for functions with multiple switch statements objtool: Rename some variables and functions objtool: Remove superflous INIT_LIST_HEAD objtool: Add helper macros for traversing instructions objtool: Fix false positive warnings related to sibling calls objtool: Compile with debugging symbols objtool: Detect infinite recursion objtool: Prevent infinite recursion in noreturn detection objtool: Detect and warn if libelf is missing and don't break the build tools: Support relative directory path for 'O=' objtool: Support CROSS_COMPILE x86/asm/decoder: Use explicitly signed chars objtool: Enable stack metadata validation on 64-bit x86 objtool: Add CONFIG_STACK_VALIDATION option objtool: Add tool to perform compile-time stack metadata validation x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standard sched: Always inline context_switch() ...
| * | | | | | | | | | x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standardJosh Poimboeuf2016-02-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | objtool reports the following warning for kretprobe_trampoline(): arch/x86/kernel/kprobes/core.o: warning: objtool: kretprobe_trampoline()+0x20: call without frame pointer save/setup kretprobes are a special case where the stack is intentionally wrong. The return address isn't known at the beginning of the trampoline, so the stack frame can't be set up properly before it calls trampoline_handler(). Because kretprobe handlers don't sleep, the frame pointer doesn't *have* to be accurate in the trampoline. So it's ok to tell objtool to ignore it. This results in no actual changes to the generated code. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at> Cc: Borislav Petkov <bp@alien8.de> Cc: Chris J Arges <chris.j.arges@canonical.com> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michal Marek <mmarek@suse.cz> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Pedro Alves <palves@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/7eaf37de52456ff822ffc86b928edb5d48a40ef1.1456719558.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | objtool: Add STACK_FRAME_NON_STANDARD() macroJosh Poimboeuf2016-02-291-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new macro, STACK_FRAME_NON_STANDARD(), which is used to denote a function which does something unusual related to its stack frame. Use of the macro prevents objtool from emitting a false positive warning. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at> Cc: Borislav Petkov <bp@alien8.de> Cc: Chris J Arges <chris.j.arges@canonical.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Pedro Alves <palves@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/34487a17b23dba43c50941599d47054a9584b219.1456719558.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>