summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* x86/acpi: Set persistent cpuid <-> nodeid mapping when bootingGu Zheng2016-09-216-2/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that, when node online/offline happens, cache based on cpuid <-> nodeid mapping such as wq_numa_possible_cpumask will not cause any problem. It contains 4 steps: 1. Enable apic registeration flow to handle both enabled and disabled cpus. 2. Introduce a new array storing all possible cpuid <-> apicid mapping. 3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid. 4. Establish all possible cpuid <-> nodeid mapping. This patch finishes step 4. This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled processors at boot time via an additional acpi namespace walk for processors. [ tglx: Remove the unneeded exports ] Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-6-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86/acpi: Enable MADT APIs to return disabled apicidsGu Zheng2016-09-212-23/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that, when node online/offline happens, cache based on cpuid <-> nodeid mapping such as wq_numa_possible_cpumask will not cause any problem. It contains 4 steps: 1. Enable apic registeration flow to handle both enabled and disabled cpus. 2. Introduce a new array storing all possible cpuid <-> apicid mapping. 3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid. 4. Establish all possible cpuid <-> nodeid mapping. This patch finishes step 3. There are four mappings in the kernel: 1. nodeid (logical node id) <-> pxm (persistent) 2. apicid (physical cpu id) <-> nodeid (persistent) 3. cpuid (logical cpu id) <-> apicid (not persistent, now persistent by step 2) 4. cpuid (logical cpu id) <-> nodeid (not persistent) So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs, we should: 1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in step 1, 2. 2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we should obtain all apicids from MADT. All processors' apicids can be obtained by _MAT method or from MADT in ACPI. The current code ignores disabled processors and returns -ENODEV. After this patch, a new parameter will be added to MADT APIs so that caller is able to control if disabled processors are ignored. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-5-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86/acpi: Introduce persistent storage for cpuid <-> apicid mappingGu Zheng2016-09-213-9/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that, when node online/offline happens, cache based on cpuid <-> nodeid mapping such as wq_numa_possible_cpumask will not cause any problem. It contains 4 steps: 1. Enable apic registeration flow to handle both enabled and disabled cpus. 2. Introduce a new array storing all possible cpuid <-> apicid mapping. 3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid. 4. Establish all possible cpuid <-> nodeid mapping. This patch finishes step 2. In this patch, we introduce a new static array named cpuid_to_apicid[], which is large enough to store info for all possible cpus. And then, we modify the cpuid calculation. In generic_processor_info(), it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid mapping changes with node hotplug. After this patch, we find the next unused cpuid, map it to an apicid, and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid mapping will be persistent. And finally we will use this array to make cpuid <-> nodeid persistent. cpuid <-> apicid mapping is established at local apic registeration time. But non-present or disabled cpus are ignored. In this patch, we establish all possible cpuid <-> apicid mapping when registering local apic. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-4-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86/acpi: Enable acpi to register all possible cpus at boot timeGu Zheng2016-09-211-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time. When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed, which means, cpuid <-> nodeid mapping will change if node hotplug happens. But workqueue does not update wq_numa_possible_cpumask. So here is the problem: Assume we have the following cpuid <-> nodeid in the beginning: Node | CPU ------------------------ node 0 | 0-14, 60-74 node 1 | 15-29, 75-89 node 2 | 30-44, 90-104 node 3 | 45-59, 105-119 and we hot-remove node2 and node3, it becomes: Node | CPU ------------------------ node 0 | 0-14, 60-74 node 1 | 15-29, 75-89 and we hot-add node4 and node5, it becomes: Node | CPU ------------------------ node 0 | 0-14, 60-74 node 1 | 15-29, 75-89 node 4 | 30-59 node 5 | 90-119 But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like. When a pool workqueue is initialized, if its cpumask belongs to a node, its pool->node will be mapped to that node. And memory used by this workqueue will also be allocated on that node. static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){ ... /* if cpumask is contained inside a NUMA node, we belong to that node */ if (wq_numa_enabled) { for_each_node(node) { if (cpumask_subset(pool->attrs->cpumask, wq_numa_possible_cpumask[node])) { pool->node = node; break; } } } Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node, which will lead to memory allocation failure: SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 It happens here: create_worker(struct worker_pool *pool) |--> worker = alloc_worker(pool->node); static struct worker *alloc_worker(int node) { struct worker *worker; worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node. ...... return worker; } [Solution] There are four mappings in the kernel: 1. nodeid (logical node id) <-> pxm 2. apicid (physical cpu id) <-> nodeid 3. cpuid (logical cpu id) <-> apicid 4. cpuid (logical cpu id) <-> nodeid 1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm mapping is setup at boot time. This mapping is persistent, won't change. 2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also persistent. 3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is allocated, lower ids first, and released at CPU hotremove time, reused for other hotadded CPUs. So this mapping is not persistent. 4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and cleared at CPU hotremove time. As a result of 3, this mapping is not persistent. To fix this problem, we establish cpuid <-> nodeid mapping for all the possible cpus at boot time, and make it persistent. And according to init_cpu_to_node(), cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid mapping. So the key point is obtaining all cpus' apicid. apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in MADT (Multiple APIC Description Table). So we finish the job in the following steps: 1. Enable apic registeration flow to handle both enabled and disabled cpus. This is done by introducing an extra parameter to generic_processor_info to let the caller control if disabled cpus are ignored. 2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when registering local apic. Store the mapping in this array. 3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid. This is also done by introducing an extra parameter to these apis to let the caller control if disabled cpus are ignored. 4. Establish all possible cpuid <-> nodeid mapping. This is done via an additional acpi namespace walk for processors. This patch finished step 1. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-3-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86/numa: Online memory-less nodes at boot timeTang Chen2016-09-211-14/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For now, x86 does not support memory-less node. A node without memory will not be onlined, and the cpus on it will be mapped to the other online nodes with memory in init_cpu_to_node(). The reason of doing this is to ensure each cpu has mapped to a node with memory, so that it will be able to allocate local memory for that cpu. But we don't have to do it in this way. In this series of patches, we are going to construct cpu <-> node mapping for all possible cpus at boot time, which is a persistent mapping. It means that the cpu will be mapped to the node which it belongs to, and will never be changed. If a node has only cpus but no memory, the cpus on it will be mapped to a memory-less node. And the memory-less node should be onlined. Allocate pgdats for all memory-less nodes and online them at boot time. Then build zonelists for these nodes. As a result, when cpus on these memory-less nodes try to allocate memory from local node, it will automatically fall back to the proper zones in the zonelists. Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-2-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86/apic: Get rid of apic_version[] arrayDenys Vlasenko2016-09-206-20/+17
| | | | | | | | | | | | | | | | | | | | | | | | The array has a size of MAX_LOCAL_APIC, which can be as large as 32k, so it can consume up to 128k. The array has been there forever and was never used for anything useful other than a version mismatch check which was introduced in 2009. There is no reason to store the version in an array. The kernel is not prepared to handle different APIC versions anyway, so the real important part is to detect a version mismatch and warn about it, which can be done with a single variable as well. [ tglx: Massaged changelog ] Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Andy Lutomirski <luto@amacapital.net> CC: Borislav Petkov <bp@alien8.de> CC: Brian Gerst <brgerst@gmail.com> CC: Mike Travis <travis@sgi.com> Link: http://lkml.kernel.org/r/20160913181232.30815-1-dvlasenk@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq()Wanpeng Li2016-09-201-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | =============================== [ INFO: suspicious RCU usage. ] 4.8.0-rc6+ #5 Not tainted ------------------------------- ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage! other info that might help us debug this: RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0 RCU used illegally from extended quiescent state! no locks held by swapper/2/0. stack backtrace: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.8.0-rc6+ #5 Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015 0000000000000000 ffff8d1bd6003f10 ffffffff94446949 ffff8d1bd4a68000 0000000000000001 ffff8d1bd6003f40 ffffffff940e9247 ffff8d1bbdfcf3d0 000000000000080b 0000000000000000 0000000000000000 ffff8d1bd6003f70 Call Trace: <IRQ> [<ffffffff94446949>] dump_stack+0x99/0xd0 [<ffffffff940e9247>] lockdep_rcu_suspicious+0xe7/0x120 [<ffffffff9448e0d5>] do_trace_write_msr+0x135/0x140 [<ffffffff9406e750>] native_write_msr+0x20/0x30 [<ffffffff9406503d>] native_apic_msr_eoi_write+0x1d/0x30 [<ffffffff9405b17e>] smp_trace_call_function_interrupt+0x1e/0x270 [<ffffffff948cb1d6>] trace_call_function_interrupt+0x96/0xa0 <EOI> [<ffffffff947200f4>] ? cpuidle_enter_state+0xe4/0x360 [<ffffffff947200df>] ? cpuidle_enter_state+0xcf/0x360 [<ffffffff947203a7>] cpuidle_enter+0x17/0x20 [<ffffffff940df008>] cpu_startup_entry+0x338/0x4d0 [<ffffffff9405bfc4>] start_secondary+0x154/0x180 This can be reproduced readily by running ftrace test case of kselftest. Move the irq_enter() call before ack_APIC_irq(), because irq_enter() tells the RCU susbstems to end the extended quiescent state, so that the following trace call in ack_APIC_irq() works correctly. The same applies to exiting_ack_irq() which calls ack_APIC_irq() after irq_exit(). [ tglx: Massaged changelog ] Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1474198491-3738-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86/ioapic: Ignore root bridges without a companion ACPI deviceRui Wang2016-09-101-1/+4
| | | | | | | | | | | | | | | | | | | | | Some PCI root bridges don't have a corresponding ACPI device. This can be the case on some old platforms. Don't call acpi_ioapic_add() on these bridges because they can't support ioapic hotplug. Reported-and-tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Rui Wang <rui.y.wang@intel.com> Reviewed-by: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bhelgaas@google.com Cc: helgaas@kernel.org Cc: linux-acpi@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: rjw@rjwysocki.net Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/1473522046-31329-1-git-send-email-rui.y.wang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/apic: Update comment about disabling processor focusWei Jiangang2016-08-241-1/+0
| | | | | | | | | | | | Fix references to discarded end_level_ioapic_irq(). Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Link: http://lkml.kernel.org/r/1471576957-12961-2-git-send-email-weijg.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/smpboot: Check APIC ID before setting up default routingWei Jiangang2016-08-241-2/+1
| | | | | | | | | | | | | | | | | | | | | | | This is not a bugfix, but code optimization. If the BSP's APIC ID in local APIC is unexpected, a kernel panic will occur and the system will halt. That means no need to enable APIC mode, and no reason to set up the default routing for APIC. The combination of default_setup_apic_routing() and apic_bsp_setup() are used to enable APIC mode. They two should be kept together, rather than being separated by the codes of checking APIC ID. Just like their usage in APIC_init_uniprocessor(). Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Link: http://lkml.kernel.org/r/1471576957-12961-1-git-send-email-weijg.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/ioapic: Fix IOAPIC failing to request resourceRui Wang2016-08-181-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | handle_ioapic_add() uses request_resource() to request ACPI "_CRS" resources. This can fail with the following error message: [ 247.325693] ACPI: \_SB_.IIO1.AID1: failed to insert resource This happens when there are multiple IOAPICs and DSDT groups their "_CRS" resources as the children of a parent resource, as seen from /proc/iomem: fec00000-fecfffff : PNP0003:00 fec00000-fec003ff : IOAPIC 0 fec01000-fec013ff : IOAPIC 1 fec40000-fec403ff : IOAPIC 2 In this case request_resource() fails because there's a conflicting resource which is the parent (fec0000-fecfffff). Fix it by using insert_resource() which can request resources by taking the conflicting resource as the parent. Signed-off-by: Rui Wang <rui.y.wang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bhelgaas@google.com Cc: helgaas@kernel.org Cc: linux-acpi@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: rjw@rjwysocki.net Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/1471420837-31003-6-git-send-email-rui.y.wang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/ioapic: Fix lost IOAPIC resource after hot-removal and hotaddRui Wang2016-08-181-16/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IOAPIC resource at 0xfecxxxxx gets lost from /proc/iomem after hot-removing and then hot-adding the IOAPIC device. After system boot, in /proc/iomem: fec00000-fecfffff : PNP0003:00 fec00000-fec003ff : IOAPIC 0 fec01000-fec013ff : IOAPIC 1 fec40000-fec403ff : IOAPIC 2 fec80000-fec803ff : IOAPIC 3 fecc0000-fecc03ff : IOAPIC 4 Then hot-remove IOAPIC 2 and hot-add it again: fec00000-fecfffff : PNP0003:00 fec00000-fec003ff : IOAPIC 0 fec01000-fec013ff : IOAPIC 1 fec80000-fec803ff : IOAPIC 3 fecc0000-fecc03ff : IOAPIC 4 The range at 0xfec40000 is lost from /proc/iomem - which is a bug. This bug happens because handle_ioapic_add() requests resources from either PCI config BAR or ACPI "_CRS", not both. But Intel platforms map the IOxAPIC registers both at the PCI config BAR (called MBAR, dynamic), and at the ACPI "_CRS" (called ABAR, static). The 0xfecX_YZ00 to 0xfecX_YZFF range appears in "_CRS" of each IOAPIC device. Both ranges should be claimed from /proc/iomem for exclusive use. Signed-off-by: Rui Wang <rui.y.wang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bhelgaas@google.com Cc: helgaas@kernel.org Cc: linux-acpi@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: rjw@rjwysocki.net Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/1471420837-31003-5-git-send-email-rui.y.wang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/ioapic: Fix setup_res() failing to get resourceRui Wang2016-08-181-1/+1
| | | | | | | | | | | | | | | | | | | acpi_dev_filter_resource_type() returns 0 on success, and 1 on failure. A return value of zero means there's a matching resource, so we should continue within setup_res() to get the resource. Signed-off-by: Rui Wang <rui.y.wang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bhelgaas@google.com Cc: helgaas@kernel.org Cc: linux-acpi@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: rjw@rjwysocki.net Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/1471420837-31003-4-git-send-email-rui.y.wang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/ioapic: Support hot-removal of IOAPICs present during bootRui Wang2016-08-182-1/+14
| | | | | | | | | | | | | | | | | | | IOAPICs present during system boot aren't added to ioapic_list, thus are unable to be hot-removed. Fix it by calling acpi_ioapic_add() during root bus enumeration. Signed-off-by: Rui Wang <rui.y.wang@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: helgaas@kernel.org Cc: linux-acpi@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: rjw@rjwysocki.net Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/1471420837-31003-3-git-send-email-rui.y.wang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/ioapic: Change prototype of acpi_ioapic_add()Rui Wang2016-08-184-6/+10
| | | | | | | | | | | | | | | | | | | Change the argument of acpi_ioapic_add() to a generic ACPI handle, and move its prototype from drivers/acpi/internal.h to include/linux/acpi.h so that it can be called from outside the pci_root driver. Signed-off-by: Rui Wang <rui.y.wang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bhelgaas@google.com Cc: helgaas@kernel.org Cc: linux-acpi@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: rjw@rjwysocki.net Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/1471420837-31003-2-git-send-email-rui.y.wang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/apic, ACPI: Fix incorrect assignment when handling apic/x2apic entriesBaoquan He2016-08-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | By pure accident the bug makes no functional difference, because the only expression where we are using these values is (!count && !x2count), in which the variables are interchangeable, but it makes sense to fix the bug nevertheless. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-acpi@vger.kernel.org Cc: rjw@rjwysocki.net Link: http://lkml.kernel.org/r/1470986507-24191-1-git-send-email-bhe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/apic, ACPI: Remove the repeated lapic address override entry parsingBaoquan He2016-08-152-16/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ACPI MADT has a 32-bit field providing lapic address at which each processor can access its lapic information. MADT also contains an optional entry to provide a 64-bit address to override the 32-bit one. However the current code does the lapic address override entry parsing twice. One is in early_acpi_boot_init() because AMD NUMA need get boot_cpu_id earlier. The other is in acpi_boot_init() which parses all MADT entries. So in this patch we remove the repeated code in the 2nd part. Meanwhile print lapic override entry information like other MADT entry, this will be added to boot log. This patch is not supposed to change any runtime behavior, other than improving kernel messages. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-acpi@vger.kernel.org Cc: rjw@rjwysocki.net Link: http://lkml.kernel.org/r/1470985033-22493-2-git-send-email-bhe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/mm/numa: Open code function early_get_boot_cpu_id()Baoquan He2016-08-153-19/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously early_acpi_boot_init() was called in early_get_boot_cpu_id() to get the value for boot_cpu_physical_apicid. Now early_acpi_boot_init() has been taken out and moved to setup_arch(), the name of early_get_boot_cpu_id() doesn't match its implementation anymore, and only the getting boot-time SMP configuration code was left. So in this patch we open code it. Also move the smp_found_config check into default_get_smp_config to simplify code, because both early_get_smp_config() and get_smp_config() call x86_init.mpparse.get_smp_config(). Also remove the redundent CONFIG_X86_MPPARSE #ifdef check when we call early_get_smp_config(). Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-acpi@vger.kernel.org Cc: rjw@rjwysocki.net Link: http://lkml.kernel.org/r/1470985033-22493-1-git-send-email-bhe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge tag 'fixes-for-linus-4.8' of ↵Linus Torvalds2016-08-142-1/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull h8300 and unicore32 architecture fixes from Guenter Roeck: "Two patches to fix h8300 and unicore32 builds. unicore32 builds have been broken since v4.6. The fix has been available in -next since March of this year. h8300 builds have been broken since the last commit window. The fix has been available in -next since June of this year" * tag 'fixes-for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: h8300: Add missing include file to asm/io.h unicore32: mm: Add missing parameter to arch_vma_access_permitted
| * h8300: Add missing include file to asm/io.hGuenter Roeck2016-08-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | h8300 builds fail with arch/h8300/include/asm/io.h:9:15: error: unknown type name ‘u8’ arch/h8300/include/asm/io.h:15:15: error: unknown type name ‘u16’ arch/h8300/include/asm/io.h:21:15: error: unknown type name ‘u32’ and many related errors. Fixes: 23c82d41bdf4 ("kexec-allow-architectures-to-override-boot-mapping-fix") Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
| * unicore32: mm: Add missing parameter to arch_vma_access_permittedGuenter Roeck2016-08-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | unicore32 fails to compile with the following errors. mm/memory.c: In function ‘__handle_mm_fault’: mm/memory.c:3381: error: too many arguments to function ‘arch_vma_access_permitted’ mm/gup.c: In function ‘check_vma_flags’: mm/gup.c:456: error: too many arguments to function ‘arch_vma_access_permitted’ mm/gup.c: In function ‘vma_permits_fault’: mm/gup.c:640: error: too many arguments to function ‘arch_vma_access_permitted’ Fixes: d61172b4b695b ("mm/core, x86/mm/pkeys: Differentiate instruction fetches") Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn>
* | Merge tag 'arm64-fixes' of ↵Linus Torvalds2016-08-148-86/+136
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: - support for nr_cpus= command line argument (maxcpus was previously changed to allow secondary CPUs to be hot-plugged) - ARM PMU interrupt handling fix - fix potential TLB conflict in the hibernate code - improved handling of EL1 instruction aborts (better error reporting) - removal of useless jprobes code for stack saving/restoring - defconfig updates * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: defconfig: enable CONFIG_LOCALVERSION_AUTO arm64: defconfig: add options for virtualization and containers arm64: hibernate: handle allocation failures arm64: hibernate: avoid potential TLB conflict arm64: Handle el1 synchronous instruction aborts cleanly arm64: Remove stack duplicating code from jprobes drivers/perf: arm-pmu: Fix handling of SPI lacking "interrupt-affinity" property drivers/perf: arm-pmu: convert arm_pmu_mutex to spinlock arm64: Support hard limit of cpu count by nr_cpus
| * | arm64: defconfig: enable CONFIG_LOCALVERSION_AUTOMasahiro Yamada2016-08-121-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_LOCALVERSION_AUTO is disabled, the version string is just a tag name (or with a '+' appended if HEAD is not a tagged commit). During the development (and especially when git-bisecting), longer version string would be helpful to identify the commit we are running. This is a default y option, so drop the unset to enable it. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | arm64: defconfig: add options for virtualization and containersRiku Voipio2016-08-121-6/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable options commonly needed by popular virtualization and container applications. Use modules when possible to avoid too much overhead for users not interested. - add namespace and cgroup options needed - add seccomp - optional, but enhances Qemu etc - bridge, nat, veth, macvtap and multicast for routing guests and containers - btfrs and overlayfs modules for container COW backends - while near it, make fuse a module instead of built-in. Generated with make saveconfig and dropping unrelated spurious change hunks while commiting. bloat-o-meter old-vmlinux vmlinux: add/remove: 905/390 grow/shrink: 767/229 up/down: 183513/-94861 (88652) .... Total: Before=10515408, After=10604060, chg +0.84% Signed-off-by: Riku Voipio <riku.voipio@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | arm64: hibernate: handle allocation failuresMark Rutland2016-08-121-27/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In create_safe_exec_page(), we create a copy of the hibernate exit text, along with some page tables to map this via TTBR0. We then install the new tables in TTBR0. In swsusp_arch_resume() we call create_safe_exec_page() before trying a number of operations which may fail (e.g. copying the linear map page tables). If these fail, we bail out of swsusp_arch_resume() and return an error code, but leave TTBR0 as-is. Subsequently, the core hibernate code will call free_basic_memory_bitmaps(), which will free all of the memory allocations we made, including the page tables installed in TTBR0. Thus, we may have TTBR0 pointing at dangling freed memory for some period of time. If the hibernate attempt was triggered by a user requesting a hibernate test via the reboot syscall, we may return to userspace with the clobbered TTBR0 value. Avoid these issues by reorganising swsusp_arch_resume() such that we have no failure paths after create_safe_exec_page(). We also add a check that the zero page allocation succeeded, matching what we have for other allocations. Fixes: 82869ac57b5d ("arm64: kernel: Add support for hibernate/suspend-to-disk") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: James Morse <james.morse@arm.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # 4.7+ Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | arm64: hibernate: avoid potential TLB conflictMark Rutland2016-08-121-6/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In create_safe_exec_page we install a set of global mappings in TTBR0, then subsequently invalidate TLBs. While TTBR0 points at the zero page, and the TLBs should be free of stale global entries, we may have stale ASID-tagged entries (e.g. from the EFI runtime services mappings) for the same VAs. Per the ARM ARM these ASID-tagged entries may conflict with newly-allocated global entries, and we must follow a Break-Before-Make approach to avoid issues resulting from this. This patch reworks create_safe_exec_page to invalidate TLBs while the zero page is still in place, ensuring that there are no potential conflicts when the new TTBR0 value is installed. As a single CPU is online while this code executes, we do not need to perform broadcast TLB maintenance, and can call local_flush_tlb_all(), which also subsumes some barriers. The remaining assembly is converted to use write_sysreg() and isb(). Other than this, we safely manipulate TTBRs in the hibernate dance. The code we install as part of the new TTBR0 mapping (the hibernated kernel's swsusp_arch_suspend_exit) installs a zero page into TTBR1, invalidates TLBs, then installs its preferred value. Upon being restored to the middle of swsusp_arch_suspend, the new image will call __cpu_suspend_exit, which will call cpu_uninstall_idmap, installing the zero page in TTBR0 and invalidating all TLB entries. Fixes: 82869ac57b5d ("arm64: kernel: Add support for hibernate/suspend-to-disk") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: James Morse <james.morse@arm.com> Tested-by: James Morse <james.morse@arm.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # 4.7+ Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | arm64: Handle el1 synchronous instruction aborts cleanlyLaura Abbott2016-08-122-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Executing from a non-executable area gives an ugly message: lkdtm: Performing direct entry EXEC_RODATA lkdtm: attempting ok execution at ffff0000084c0e08 lkdtm: attempting bad execution at ffff000008880700 Bad mode in Synchronous Abort handler detected on CPU2, code 0x8400000e -- IABT (current EL) CPU: 2 PID: 998 Comm: sh Not tainted 4.7.0-rc2+ #13 Hardware name: linux,dummy-virt (DT) task: ffff800077e35780 ti: ffff800077970000 task.ti: ffff800077970000 PC is at lkdtm_rodata_do_nothing+0x0/0x8 LR is at execute_location+0x74/0x88 The 'IABT (current EL)' indicates the error but it's a bit cryptic without knowledge of the ARM ARM. There is also no indication of the specific address which triggered the fault. The increase in kernel page permissions makes hitting this case more likely as well. Handling the case in the vectors gives a much more familiar looking error message: lkdtm: Performing direct entry EXEC_RODATA lkdtm: attempting ok execution at ffff0000084c0840 lkdtm: attempting bad execution at ffff000008880680 Unable to handle kernel paging request at virtual address ffff000008880680 pgd = ffff8000089b2000 [ffff000008880680] *pgd=00000000489b4003, *pud=0000000048904003, *pmd=0000000000000000 Internal error: Oops: 8400000e [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 997 Comm: sh Not tainted 4.7.0-rc1+ #24 Hardware name: linux,dummy-virt (DT) task: ffff800077f9f080 ti: ffff800008a1c000 task.ti: ffff800008a1c000 PC is at lkdtm_rodata_do_nothing+0x0/0x8 LR is at execute_location+0x74/0x88 Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | arm64: Remove stack duplicating code from jprobesDavid A. Long2016-08-112-28/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because the arm64 calling standard allows stacked function arguments to be anywhere in the stack frame, do not attempt to duplicate the stack frame for jprobes handler functions. Documentation changes to describe this issue have been broken out into a separate patch in order to simultaneously address them in other architecture(s). Signed-off-by: David A. Long <dave.long@linaro.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | drivers/perf: arm-pmu: Fix handling of SPI lacking "interrupt-affinity" propertyMarc Zyngier2016-08-091-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch 19a469a58720 ("drivers/perf: arm-pmu: Handle per-interrupt affinity mask") added support for partitionned PPI setups, but inadvertently broke setups using SPIs without the "interrupt-affinity" property (which is the case for UP platforms). This patch restore the broken functionnality by testing whether the interrupt is percpu or not instead of relying on the using_spi flag that really means "SPI *and* interrupt-affinity property". Acked-by: Mark Rutland <mark.rutland@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 19a469a58720 ("drivers/perf: arm-pmu: Handle per-interrupt affinity mask") Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
| * | drivers/perf: arm-pmu: convert arm_pmu_mutex to spinlockSudeep Holla2016-08-091-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | arm_pmu_mutex is never held long and we don't want to sleep while the lock is being held as it's executed in the context of hotplug notifiers. So it can be converted to a simple spinlock instead. Without this patch we get the following warning: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/2 no locks held by swapper/2/0. irq event stamp: 381314 hardirqs last enabled at (381313): _raw_spin_unlock_irqrestore+0x7c/0x88 hardirqs last disabled at (381314): cpu_die+0x28/0x48 softirqs last enabled at (381294): _local_bh_enable+0x28/0x50 softirqs last disabled at (381293): irq_enter+0x58/0x78 CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.7.0 #12 Call trace: dump_backtrace+0x0/0x220 show_stack+0x24/0x30 dump_stack+0xb4/0xf0 ___might_sleep+0x1d8/0x1f0 __might_sleep+0x5c/0x98 mutex_lock_nested+0x54/0x400 arm_perf_starting_cpu+0x34/0xb0 cpuhp_invoke_callback+0x88/0x3d8 notify_cpu_starting+0x78/0x98 secondary_start_kernel+0x108/0x1a8 This patch converts the mutex to spinlock to eliminate the above warnings. This constraints pmu->reset to be non-blocking call which is the case with all the ARM PMU backends. Cc: Stephen Boyd <sboyd@codeaurora.org> Fixes: 37b502f121ad ("arm/perf: Fix hotplug state machine conversion") Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
| * | arm64: Support hard limit of cpu count by nr_cpusKefeng Wang2016-08-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable the hard limit of cpu count by set boot options nr_cpus=x on arm64, and make a minor change about message when total number of cpu exceeds the limit. Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reported-by: Shiyuan Hu <hushiyuan@huawei.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
* | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2016-08-138-53/+118
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Radim Krčmář: "KVM: - lock kvm_device list to prevent corruption on device creation. PPC: - split debugfs initialization from creation of the xics device to unlock the newly taken kvm lock earlier. s390: - prevent userspace from triggering two WARN_ON_ONCE. MIPS: - fix several issues in the management of TLB faults (Cc: stable)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: MIPS: KVM: Propagate kseg0/mapped tlb fault errors MIPS: KVM: Fix gfn range check in kseg0 tlb faults MIPS: KVM: Add missing gfn range check MIPS: KVM: Fix mapped fault broken commpage handling KVM: Protect device ops->create and list_add with kvm->lock KVM: PPC: Move xics_debugfs_init out of create KVM: s390: reset KVM_REQ_MMU_RELOAD if mapping the prefix failed KVM: s390: set the prefix initially properly
| * \ \ Merge tag 'kvm-s390-master-4.8-1' of ↵Radim Krčmář2016-08-121-1/+4
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux KVM: s390: Fixes for 4.8 (via kvm/master) Here are two fixes found by fuzzing of the ioctl interface. Both cases can trigger a WARN_ON_ONCE from user space.
| | * | | KVM: s390: reset KVM_REQ_MMU_RELOAD if mapping the prefix failedJulius Niedworok2016-08-121-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When triggering KVM_RUN without a user memory region being mapped (KVM_SET_USER_MEMORY_REGION) a validity intercept occurs. This could happen, if the user memory region was not mapped initially or if it was unmapped after the vcpu is initialized. The function kvm_s390_handle_requests checks for the KVM_REQ_MMU_RELOAD bit. The check function always clears this bit. If gmap_mprotect_notify returns an error code, the mapping failed, but the KVM_REQ_MMU_RELOAD was not set anymore. So the next time kvm_s390_handle_requests is called, the execution would fall trough the check for KVM_REQ_MMU_RELOAD. The bit needs to be resetted, if gmap_mprotect_notify returns an error code. Resetting the bit with kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu) fixes the bug. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Julius Niedworok <jniedwor@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| | * | | KVM: s390: set the prefix initially properlyJulius Niedworok2016-08-121-0/+1
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When KVM_RUN is triggered on a VCPU without an initial reset, a validity intercept occurs. Setting the prefix will set the KVM_REQ_MMU_RELOAD bit initially, thus preventing the bug. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Julius Niedworok <jniedwor@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * | | MIPS: KVM: Propagate kseg0/mapped tlb fault errorsJames Hogan2016-08-122-12/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Propagate errors from kvm_mips_handle_kseg0_tlb_fault() and kvm_mips_handle_mapped_seg_tlb_fault(), usually triggering an internal error since they normally indicate the guest accessed bad physical memory or the commpage in an unexpected way. Fixes: 858dd5d45733 ("KVM/MIPS32: MMU/TLB operations for the Guest.") Fixes: e685c689f3a8 ("KVM/MIPS32: Privileged instruction/target branch emulation.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | MIPS: KVM: Fix gfn range check in kseg0 tlb faultsJames Hogan2016-08-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two consecutive gfns are loaded into host TLB, so ensure the range check isn't off by one if guest_pmap_npages is odd. Fixes: 858dd5d45733 ("KVM/MIPS32: MMU/TLB operations for the Guest.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | MIPS: KVM: Add missing gfn range checkJames Hogan2016-08-121-8/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_mips_handle_mapped_seg_tlb_fault() calculates the guest frame number based on the guest TLB EntryLo values, however it is not range checked to ensure it lies within the guest_pmap. If the physical memory the guest refers to is out of range then dump the guest TLB and emit an internal error. Fixes: 858dd5d45733 ("KVM/MIPS32: MMU/TLB operations for the Guest.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | MIPS: KVM: Fix mapped fault broken commpage handlingJames Hogan2016-08-121-21/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_mips_handle_mapped_seg_tlb_fault() appears to map the guest page at virtual address 0 to PFN 0 if the guest has created its own mapping there. The intention is unclear, but it may have been an attempt to protect the zero page from being mapped to anything but the comm page in code paths you wouldn't expect from genuine commpage accesses (guest kernel mode cache instructions on that address, hitting trapping instructions when executing from that address with a coincidental TLB eviction during the KVM handling, and guest user mode accesses to that address). Fix this to check for mappings exactly at KVM_GUEST_COMMPAGE_ADDR (it may not be at address 0 since commit 42aa12e74e91 ("MIPS: KVM: Move commpage so 0x0 is unmapped")), and set the corresponding EntryLo to be interpreted as 0 (invalid). Fixes: 858dd5d45733 ("KVM/MIPS32: MMU/TLB operations for the Guest.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | KVM: Protect device ops->create and list_add with kvm->lockChristoffer Dall2016-08-125-17/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM devices were manipulating list data structures without any form of synchronization, and some implementations of the create operations also suffered from a lack of synchronization. Now when we've split the xics create operation into create and init, we can hold the kvm->lock mutex while calling the create operation and when manipulating the devices list. The error path in the generic code gets slightly ugly because we have to take the mutex again and delete the device from the list, but holding the mutex during anon_inode_getfd or releasing/locking the mutex in the common non-error path seemed wrong. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | KVM: PPC: Move xics_debugfs_init out of createChristoffer Dall2016-08-123-2/+17
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As we are about to hold the kvm->lock during the create operation on KVM devices, we should move the call to xics_debugfs_init into its own function, since holding a mutex over extended amounts of time might not be a good idea. Introduce an init operation on the kvm_device_ops struct which cannot fail and call this, if configured, after the device has been created. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2016-08-1310-85/+160
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - an NVMe fix from Gabriel, fixing a suspend/resume issue on some setups - addition of a few missing entries in the block queue sysfs documentation, from Joe - a fix for a sparse shadow warning for the bvec iterator, from Johannes - a writeback deadlock involving raid issuing barriers, and not flushing the plug when we wakeup the flusher threads. From Konstantin - a set of patches for the NVMe target/loop/rdma code, from Roland and Sagi * 'for-linus' of git://git.kernel.dk/linux-block: bvec: avoid variable shadowing warning doc: update block/queue-sysfs.txt entries nvme: Suspend all queues before deletion mm, writeback: flush plugged IO in wakeup_flusher_threads() nvme-rdma: Remove unused includes nvme-rdma: start async event handler after reconnecting to a controller nvmet: Fix controller serial number inconsistency nvmet-rdma: Don't use the inline buffer in order to avoid allocation for small reads nvmet-rdma: Correctly handle RDMA device hot removal nvme-rdma: Make sure to shutdown the controller if we can nvme-loop: Remove duplicate call to nvme_remove_namespaces nvme-rdma: Free the I/O tags when we delete the controller nvme-rdma: Remove duplicate call to nvme_remove_namespaces nvme-rdma: Fix device removal handling nvme-rdma: Queue ns scanning after a sucessful reconnection nvme-rdma: Don't leak uninitialized memory in connect request private data
| * | bvec: avoid variable shadowing warningJohannes Berg2016-08-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to the (indirect) nesting of min(..., min(...)), sparse will show a variable shadowing warning whenever bvec.h is included. Avoid that by assigning the inner min() to a temporary variable first. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | doc: update block/queue-sysfs.txt entriesJoe Lawrence2016-08-111-0/+18
| | | | | | | | | | | | | | | | | | | | | Add descriptions for dax, io_poll, and write_same_max_bytes files. Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nvme: Suspend all queues before deletionGabriel Krisman Bertazi2016-08-111-12/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When nvme_delete_queue fails in the first pass of the nvme_disable_io_queues() loop, we return early, failing to suspend all of the IO queues. Later, on the nvme_pci_disable path, this causes us to disable MSI without actually having freed all the IRQs, which triggers the BUG_ON in free_msi_irqs(), as show below. This patch refactors nvme_disable_io_queues to suspend all queues before start submitting delete queue commands. This way, we ensure that we have at least returned every IRQ before continuing with the removal path. [ 487.529200] kernel BUG at ../drivers/pci/msi.c:368! cpu 0x46: Vector: 700 (Program Check) at [c0000078c5b83650] pc: c000000000627a50: free_msi_irqs+0x90/0x200 lr: c000000000627a40: free_msi_irqs+0x80/0x200 sp: c0000078c5b838d0 msr: 9000000100029033 current = 0xc0000078c5b40000 paca = 0xc000000002bd7600 softe: 0 irq_happened: 0x01 pid = 1376, comm = kworker/70:1H kernel BUG at ../drivers/pci/msi.c:368! Linux version 4.7.0.mainline+ (root@iod76) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2.1) ) #104 SMP Fri Jul 29 09:20:17 CDT 2016 enter ? for help [c0000078c5b83920] d0000000363b0cd8 nvme_dev_disable+0x208/0x4f0 [nvme] [c0000078c5b83a10] d0000000363b12a4 nvme_timeout+0xe4/0x250 [nvme] [c0000078c5b83ad0] c0000000005690e4 blk_mq_rq_timed_out+0x64/0x110 [c0000078c5b83b40] c00000000056c930 bt_for_each+0x160/0x170 [c0000078c5b83bb0] c00000000056d928 blk_mq_queue_tag_busy_iter+0x78/0x110 [c0000078c5b83c00] c0000000005675d8 blk_mq_timeout_work+0xd8/0x1b0 [c0000078c5b83c50] c0000000000e8cf0 process_one_work+0x1e0/0x590 [c0000078c5b83ce0] c0000000000e9148 worker_thread+0xa8/0x660 [c0000078c5b83d80] c0000000000f2090 kthread+0x110/0x130 [c0000078c5b83e30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Cc: Brian King <brking@linux.vnet.ibm.com> Cc: Keith Busch <keith.busch@intel.com> Cc: linux-nvme@lists.infradead.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * | mm, writeback: flush plugged IO in wakeup_flusher_threads()Konstantin Khlebnikov2016-08-101-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I've found funny live-lock between raid10 barriers during resync and memory controller hard limits. Inside mpage_readpages() task holds on to its plug bio which blocks the barrier in raid10. Its memory cgroup have no free memory thus the task goes into reclaimer but all reclaimable pages are dirty and cannot be written because raid10 is rebuilding and stuck on the barrier. Common flush of such IO in schedule() never happens, because the caller doesn't go to sleep. Lock is 'live' because changing memory limit or killing tasks which holds that stuck bio unblock whole progress. That was what happened in 3.18.x but I see no difference in upstream logic. Theoretically this might happen even without memory cgroup. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | Merge branch 'nvmf-4.8-rc' of git://git.infradead.org/nvme-fabrics into ↵Jens Axboe2016-08-086-72/+126
| |\ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for-linus Sagi writes: Mostly stability fixes for nvmet, rdma: - fix uninitialized rdma_cm private data from Roland. - rdma device removal handling (host and target). - fix controller disconnect during active mounts. - fix namespaces lost after fabric reconnects. - remove redundant calls to namespace removal (rdma, loop). - actually send controller shutdown when disconnecting. - reconnect fixes (ns rescan and aen requeue) - nvmet controller serial number inconsistency fix.
| | * nvme-rdma: Remove unused includesSagi Grimberg2016-08-041-3/+0
| | | | | | | | | | | | | | | Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Steve Wise <swise@opengridcomputing.com>
| | * nvme-rdma: start async event handler after reconnecting to a controllerSagi Grimberg2016-08-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we reset or reconnect to a controller, we are cancelling the async event handler so we can safely re-establish resources, but we need to remember to start it again when we successfully reconnect. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de>
| | * nvmet: Fix controller serial number inconsistencySagi Grimberg2016-08-043-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | The host is allowed to issue identify as many times as it wants, we need to stay consistent when reporting the serial number for a given controller. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de>