summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* nvdimm: Replace lockdep_mutex with local lock classesDan Williams2022-04-288-15/+37
| | | | | | | | | | | | | | | | | | | | | In response to an attempt to expand dev->lockdep_mutex for device_lock() validation [1], Peter points out [2] that the lockdep API already has the ability to assign a dedicated lock class per subsystem device-type. Use lockdep_set_class() to override the default device_lock() '__lockdep_no_validate__' class for each NVDIMM subsystem device-type. This enables lockdep to detect deadlocks and recursive locking within the device-driver core and the subsystem. Link: https://lore.kernel.org/r/164982968798.684294.15817853329823976469.stgit@dwillia2-desk3.amr.corp.intel.com [1] Link: https://lore.kernel.org/r/Ylf0dewci8myLvoW@hirez.programming.kicks-ass.net [2] Suggested-by: Peter Zijlstra <peterz@infradead.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/165055520896.3745911.8021255583475547548.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl: Drop cxl_device_lock()Dan Williams2022-04-286-126/+33
| | | | | | | | | | | | | | Now that all CXL subsystem locking is validated with custom lock classes, there is no need for the custom usage of the lockdep_mutex. Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/165055520383.3745911.53447786039115271.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/acpi: Add root device lockdep validationDan Williams2022-04-283-1/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CXL "root" device, ACPI0017, is an attach point for coordinating platform level CXL resources and is the parent device for a CXL port topology tree. As such it has distinct locking rules relative to other CXL subsystem objects, but because it is an ACPI device the lock class is established well before it is given to the cxl_acpi driver. However, the lockdep API does support changing the lock class "live" for situations like this. Add a device_lock_set_class() helper that a driver can use in ->probe() to set a custom lock class, and device_lock_reset_class() to return to the default "no validate" class before the custom lock class key goes out of scope after ->remove(). Note the helpers are all macros to support dead code elimination in the CONFIG_PROVE_LOCKING=n case, however device_set_lock_class() still needs #ifdef CONFIG_PROVE_LOCKING since lockdep_match_class() explicitly does not have a helper in the CONFIG_PROVE_LOCKING=n case (see comment in lockdep.h). The lockdep API needs 2 small tweaks to prevent "unused" warnings for the @key argument to lock_set_class(), and a new lock_set_novalidate_class() is added to supplement lockdep_set_novalidate_class() in the cases where the lock class is converted while the lock is held. Suggested-by: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/165100081305.1528964.11138612430659737238.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl: Replace lockdep_mutex with local lock classesDan Williams2022-04-283-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In response to an attempt to expand dev->lockdep_mutex for device_lock() validation [1], Peter points out [2] that the lockdep API already has the ability to assign a dedicated lock class per subsystem device-type. Use lockdep_set_class() to override the default device_lock() '__lockdep_no_validate__' class for each CXL subsystem device-type. This enables lockdep to detect deadlocks and recursive locking within the device-driver core and the subsystem. The lockdep_set_class_and_subclass() API is used for port objects that recursively lock the 'cxl_port_key' class by hierarchical topology depth. Link: https://lore.kernel.org/r/164982968798.684294.15817853329823976469.stgit@dwillia2-desk3.amr.corp.intel.com [1] Link: https://lore.kernel.org/r/Ylf0dewci8myLvoW@hirez.programming.kicks-ass.net [2] Suggested-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/165055519317.3745911.7342499516839702840.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* PCI/ACPI: negotiate CXL _OSCVishal Verma2022-04-283-22/+188
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add full support for negotiating _OSC as defined in the CXL 2.0 spec, as applicable to CXL-enabled platforms. Advertise support for the CXL features we support - 'CXL 2.0 port/device register access', 'Protocol Error Reporting', and 'CXL Native Hot Plug'. Request control for 'CXL Memory Error Reporting'. The requests are dependent on CONFIG_* based prerequisites, and prior PCI enabling, similar to how the standard PCI _OSC bits are determined. The CXL specification does not define any additional constraints on the hotplug flow beyond PCIe native hotplug, so a kernel that supports native PCIe hotplug, supports CXL hotplug. For error handling protocol and link errors just use PCIe AER. There is nascent support for amending AER events with CXL specific status [1], but there's otherwise no additional OS responsibility for CXL errors beyond PCIe AER. CXL Memory Errors behave the same as typical memory errors so CONFIG_MEMORY_FAILURE is sufficient to indicate support to platform firmware. [1]: https://lore.kernel.org/linux-cxl/164740402242.3912056.8303625392871313860.stgit@dwillia2-desk3.amr.corp.intel.com/ Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Robert Moore <robert.moore@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20220413073618.291335-4-vishal.l.verma@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* PCI/ACPI: Prefer CXL _OSC instead of PCIe _OSC for CXL host bridgesDan Williams2022-04-283-14/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | OB In preparation for negotiating OS control of CXL _OSC features, do the minimal enabling to use CXL _OSC to handle the base PCIe feature negotiation. Recall that CXL _OSC is a super-set of PCIe _OSC and the CXL 2.0 specification mandates: "If a CXL Host Bridge device exposes CXL _OSC, CXL aware OSPM shall evaluate CXL _OSC and not evaluate PCIe _OSC." Rather than pass a boolean flag alongside @root to all the helper functions that need to consider PCIe specifics, add is_pcie() and is_cxl() helper functions to check the flavor of @root. This also allows for dynamic fallback to PCIe _OSC in cases where an attempt to use CXL _OXC fails. This can happen on CXL 1.1 platforms that publish ACPI0016 devices to indicate CXL host bridges, but do not publish the optional CXL _OSC method. CXL _OSC is mandatory for CXL 2.0 hosts. Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Robert Moore <robert.moore@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20220413073618.291335-3-vishal.l.verma@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* PCI/ACPI: add a helper for retrieving _OSC Control DWORDsVishal Verma2022-04-283-2/+15
| | | | | | | | | | | | | | | | | | | During _OSC negotiation, when the 'Control' DWORD is needed from the result buffer after running _OSC, a couple of places performed manual pointer arithmetic to offset into the right spot in the raw buffer. Add a acpi_osc_ctx_get_pci_control() helper to use the #define'd DWORD offsets to fetch the DWORDs needed from @acpi_osc_context, and replace the above instances of the open-coded arithmetic. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Suggested-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed by: Adam Manzanares <a.manzanares@samsung.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20220413073618.291335-2-vishal.l.verma@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: fix logical vs bitwise typoDan Carpenter2022-04-281-1/+1
| | | | | | | | | This should be bitwise & instead of &&. Fixes: 6179045ccc0c ("cxl/mbox: Block immediate mode in SET_PARTITION_INFO command") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/YmpgkbbQ1Yxu36uO@kili Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Replace NULL check with IS_ERR() after vmemdup_user()Alison Schofield2022-04-231-1/+1
| | | | | | | | | | | | vmemdup_user() returns an ERR_PTR() on failure. Use IS_ERR() to check the return value. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20220407010915.1211258-1-alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Use type __u32 for mailbox payload sizesAlison Schofield2022-04-232-20/+22
| | | | | | | | | | | | | | | | | | | | | | | | Payload sizes for mailbox commands are expected to be positive values coming from userspace. The documentation correctly describes these as always unsigned values. The mailbox and send structures that support the mailbox commands however, use __s32 types for the payloads. Replace __s32 with __u32 in the mailbox and send command structures and update usages. Kernel users of the interface already block all negative values and there is no known ability for userspace to have grown a dependency on submitting negative values to the kernel. The known user of the IOCTL, the CXL command line interface (cxl-cli) already enforces positive size values. A Smatch warning of a signedness uncovered this issue. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Link: https://lore.kernel.org/r/20220414051246.1244575-1-alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* PM: CXL: Disable suspendDan Williams2022-04-2311-6/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CXL specification claims S3 support at a hardware level, but at a system software level there are some missing pieces. Section 9.4 (CXL 2.0) rightly claims that "CXL mem adapters may need aux power to retain memory context across S3", but there is no enumeration mechanism for the OS to determine if a given adapter has that support. Moreover the save state and resume image for the system may inadvertantly end up in a CXL device that needs to be restored before the save state is recoverable. I.e. a circular dependency that is not resolvable without a third party save-area. Arrange for the cxl_mem driver to fail S3 attempts. This still nominaly allows for suspend, but requires unbinding all CXL memory devices before the suspend to ensure the typical DRAM flow is taken. The cxl_mem unbind flow is intended to also tear down all CXL memory regions associated with a given cxl_memdev. It is reasonable to assume that any device participating in a System RAM range published in the EFI memory map is covered by aux power and save-area outside the device itself. So this restriction can be minimized in the future once pre-existing region enumeration support arrives, and perhaps a spec update to clarify if the EFI memory map is sufficent for determining the range of devices managed by platform-firmware for S3 support. Per Rafael, if the CXL configuration prevents suspend then it should fail early before tasks are frozen, and mem_sleep should stop showing 'mem' as an option [1]. Effectively CXL augments the platform suspend ->valid() op since, for example, the ACPI ops are not aware of the CXL / PCI dependencies. Given the split role of platform firmware vs OS provisioned CXL memory it is up to the cxl_mem driver to determine if the CXL configuration has elements that platform firmware may not be prepared to restore. Link: https://lore.kernel.org/r/CAJZ5v0hGVN_=3iU8OLpHY3Ak35T5+JcBM-qs8SbojKrpd0VXsA@mail.gmail.com [1] Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <len.brown@intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/165066828317.3907920.5690432272182042556.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mem: Replace redundant debug message with a commentDan Williams2022-04-131-4/+10
| | | | | | | | | | | | | | | | cxl_mem_probe() already emits a log message when HDM operation can not be established. Delete the similar one in cxl_hdm_decode_init(). What is less obvious is why global_ctrl being enabled makes positive values of info->ranges irrelevant, and the Linux behavior with respect to the spec recommendation to mirror CXL Range registers with HDM Decoder Base + Size registers. Cc: Ben Widawsky <ben.widawsky@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/r/164944616743.454665.7055846627973202403.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mem: Rename cxl_dvsec_decode_init() to cxl_hdm_decode_init()Dan Williams2022-04-132-7/+7
| | | | | | | | | | | | | cxl_dvsec_decode_init() is tasked with checking whether legacy DVSEC range based decode is in effect, or whether HDM can be enabled / already is enabled. As such it either succeeds or fails and that result is the return value. The @do_hdm_init variable is misleading in the case where HDM operation is already found to be active, so just call it @retval. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/r/164730736435.3806189.2537160791687837469.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/pci: Make cxl_dvsec_ranges() failure not fatal to cxl_pciDan Williams2022-04-131-9/+18
| | | | | | | | | | | | | | | | | | | | | | cxl_dvsec_ranges(), the helper for enumerating the presence of an active legacy CXL.mem configuration on a CXL 2.0 Memory Expander, is not fatal for cxl_pci because there is still value to enable mailbox operations even if CXL.mem operation is disabled. Recall that the reason cxl_pci does this initialization and not cxl_mem is to preserve the useful property (for unit testing) that cxl_mem is cxl_memdev + mmio generic, and does not require access to a 'struct pci_dev' to issue config cycles. Update 'struct cxl_endpoint_dvsec_info' to carry either a positive number of non-zero size legacy CXL DVSEC ranges, or the negative error code from __cxl_dvsec_ranges() in its @ranges member. Reported-by: Krzysztof Zach <krzysztof.zach@intel.com> Fixes: 560f78559006 ("cxl/pci: Retrieve CXL DVSEC memory info") Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/r/164730735869.3806189.4032428192652531946.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mem: Make cxl_dvsec_range() init failure fatalDan Williams2022-04-131-0/+3
| | | | | | | | | | | | | | | In preparation for the cxl_pci driver to continue operation after cxl_dvsec_range() failure, update cxl_mem to check for negative error codes in info->ranges. Treat that condition as fatal regardless of the state of the HDM configuration since cxl_mem needs positive confirmation that legacy ranges were not established by platform firmware or another agent. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com. Reviewed-by: Ben Widawsky <ben.widawsky@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/r/164730735324.3806189.4167509857771192422.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/pci: Add debug for DVSEC range init failuresDan Williams2022-04-131-3/+10
| | | | | | | | | | | | | | | | In preparation for not treating DVSEC range initialization failures as fatal to cxl_pci_probe() add individual dev_dbg() statements for each of the major failure reasons in cxl_dvsec_ranges(). The rationale for cxl_dvsec_ranges() failure not being fatal is that there is still value for cxl_pci to enable mailbox operations even if CXL.mem operation is disabled. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ben Widawsky <ben.widawsky@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/r/164730734812.3806189.2726330688692684104.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mem: Drop DVSEC vs EFI Memory Map sanity checkDan Williams2022-04-131-23/+1
| | | | | | | | | | | | | | | | | | | | | | When the driver finds legacy DVSEC ranges active on a CXL Memory Expander it indicates that platform firmware is not aware of, or is deliberately disabling common CXL 2.0 operation. In this case Linux generally has no choice, but to leave the device alone. The driver attempts to validate that the DVSEC range is in the EFI memory map. Remove that logic since there is no requirement that the BIOS publish DVSEC ranges in the EFI Memory Map. In the future the driver will want to permanently reserve this capacity out of the available CFMWS capacity and hide it from request_free_mem_region(), but it serves no purpose to warn about the range not appearing in the EFI Memory Map. Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ben Widawsky <ben.widawsky@intel.com> Link: https://lore.kernel.org/r/164730734246.3806189.13995924771963139898.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Use new return_code handlingDavidlohr Bueso2022-04-132-3/+3
| | | | | | | | | | | | Use the global cxl_mbox_cmd_rc table to improve debug messaging in __cxl_pci_mbox_send_cmd() and allow cxl_mbox_send_cmd() to map to proper kernel style errno codes - this patch continues to use -ENXIO only so no change in semantics. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed by: Adam Manzanares <a.manzanares@samsung.com> Link: https://lore.kernel.org/r/20220404021216.66841-5-dave@stgolabs.net Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Improve handling of mbox_cmd hw return codesDavidlohr Bueso2022-04-133-3/+54
| | | | | | | | | | | | | | Upon a completed command the caller is still expected to check the actual return_code register to ensure it succeed. This adds, per the spec, the potential command return codes. It maps the hardware return code with the kernel's errno style, and by default continues to use -ENXIO (Command completed, but device reported an error). Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed by: Adam Manzanares <a.manzanares@samsung.com> Link: https://lore.kernel.org/r/20220404021216.66841-4-dave@stgolabs.net Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/pci: Use CXL_MBOX_SUCCESS to check against mbox_cmd return codeDavidlohr Bueso2022-04-131-2/+2
| | | | | | | | | | Also mention the need for the caller to check against any errors from the hardware in return_code. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed by: Adam Manzanares <a.manzanares@samsung.com> Link: https://lore.kernel.org/r/20220404021216.66841-3-dave@stgolabs.net Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Drop mbox_mutex commentDavidlohr Bueso2022-04-131-1/+1
| | | | | | | | | ... we have lockdep for this. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed by: Adam Manzanares <a.manzanares@samsung.com> Link: https://lore.kernel.org/r/20220404021216.66841-2-dave@stgolabs.net Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/pmem: Remove CXL SET_PARTITION_INFO from exclusive_cmds listAlison Schofield2022-04-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | With SET_PARTITION_INFO on the exclusive_cmds list for the CXL_PMEM driver, userspace cannot execute a set-partition command without first unbinding the pmem driver from the device. When userspace requests a partition change to take effect on the next reboot this unbind requirement is unnecessarily restrictive. The driver does not need to enforce an unbind because partitions will not change until the next reboot. Of course, userspace still needs to be aware that changing the size of persistent capacity on the next reboot will result in the loss of data stored. That can happen regardless of whether it is presently bound at the time of issuing the set-partition command. When userspace requests a partition change to take effect immediately, restrictions are needed. The CXL_MEM driver currently blocks the usage of immediate mode, making the presence of SET_PARTITION_INFO, in this exclusive commands list, redundant. In the future, when the CXL_MEM driver adds support for immediate changes to device partitions it will ensure that the partition change will not affect any active decode. That means the work will not fall right back here, onto the CXL_PMEM driver. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Link: https://lore.kernel.org/r/accc6abc878f0662093b81490a1a052f2ff6f06e.1648687552.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Block immediate mode in SET_PARTITION_INFO commandAlison Schofield2022-04-132-0/+48
| | | | | | | | | | | | | | | | | | | | | | | User space may send the SET_PARTITION_INFO mailbox command using the IOCTL interface. Inspect the input payload and fail if the immediate flag is set. This is the first instance of the driver inspecting an input payload from user space. Assume there will be more such cases and implement with an extensible helper. In order for the kernel to react to an immediate partition change it needs to assert that the change will not affect any active decode. At a minimum this requires validating that the device is using HDM decoders instead of the CXL DVSEC for decode, and that none of the active HDM decoders are affected by the partition change. For now, just fail until that support arrives. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/241821186c363833980adbc389e2c547bc5a6395.1648687552.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Move cxl_mem_command param to a local variableAlison Schofield2022-04-131-12/+8
| | | | | | | | | | | | | | | | | cxl_validate_command_from_user() is now the single point of validation for mailbox commands coming from user space. Previously, it returned a a cxl_mem_command, but that was not sufficient when validation of the actual mailbox command became a requirement. Now, it returns a fully validated cxl_mbox_cmd. Remove the extraneous cxl_mem_command parameter. Define and use a local version only. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/c11a437896d914daf36f5ac8ec62f999c5ec2da7.1648687552.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Make handle_mailbox_cmd_from_user() use a mbox paramAlison Schofield2022-04-131-27/+17
| | | | | | | | | | | | | | | | Previously, handle_mailbox_cmd_from_user(), constructed the mailbox command and dispatched it to the hardware. The construction work has moved to the validation path. handle_mailbox_cmd_from_user() now expects a fully validated mbox param. Make it's caller, cxl_send_cmd(), deliver it. Update the comments and dereferencing of the new mbox parameter. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/77050ba512d6c30eccf7505467509e460dd325a0.1648687552.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Remove dependency on cxl_mem_command for a debug msgAlison Schofield2022-04-131-3/+14
| | | | | | | | | | | | | In preparation for removing access to struct cxl_mem_command, change this debug message to use cxl_mbox_cmd fields instead. Retrieve the pretty command name from cxl_mbox_cmd using a new opcode to command name helper. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/57265751d336a6e95f5ca31a9c77189408b05742.1648687552.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Construct a users cxl_mbox_cmd in the validation pathAlison Schofield2022-04-131-4/+17
| | | | | | | | | | | | | | | | This is a step in refactoring the handling of user space mailbox commands. The intent is to have all the validation work originate in cxl_validate_cmd_from_user(). Move the construction and validation of a mailbox command to the validation path. Continue to pass both the out_cmd and the mbox_cmd until handle_mbox_cmd_from_user() learns how to use a mbox_cmd param. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/c9fbdad968a2b619f9108bb6c37cef1a853cdf5a.1648687552.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Move build of user mailbox cmd to a helper functionsAlison Schofield2022-04-131-25/+45
| | | | | | | | | | In preparation for moving the construction of a mailbox command to the validation path, extract the work into a helper functions. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/493d7618a846d787c3ae28778935ca35e2b85eed.1648687552.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Move raw command warning to raw command validationAlison Schofield2022-04-131-3/+2
| | | | | | | | | | | | | This move serves two purposes: 1) Emit the warning in the raw command validation path, and 2) Remove the dependency on the struct cxl_mem_command in handle_mailbox_cmd_from_user() in preparation for a refactor of that function. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/df5f0e0ec8afa1f75299aa86b4226ab4479ef325.1648687552.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* cxl/mbox: Move cxl_mem_command construction to helper funcsAlison Schofield2022-04-131-71/+76
| | | | | | | | | | | | | | | | Sanitizing and constructing a cxl_mem_command from a userspace command is part of the validation process prior to submitting the command to a CXL device. Move this work to helper functions: cxl_to_mem_cmd(), cxl_to_mem_cmd_raw(). This declutters cxl_validate_cmd_from_user() in preparation for adding new validation steps. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/7d9b826f29262e3a484cb4bb7b63872134d60bd7.1648687552.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Linux 5.18-rc2v5.18-rc2Linus Torvalds2022-04-111-1/+1
|
* Merge tag 'tty-5.18-rc2' of ↵Linus Torvalds2022-04-101-10/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull serial driver fix from Greg KH: "This is a single serial driver fix for a build issue that showed up due to changes that came in through the tty tree in 5.18-rc1 that were missed previously. It resolves a build error with the mpc52xx_uart driver. It has been in linux-next this week with no reported problems" * tag 'tty-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: tty: serial: mpc52xx_uart: make rx/tx hooks return unsigned, part II.
| * tty: serial: mpc52xx_uart: make rx/tx hooks return unsigned, part II.Jiri Slaby2022-04-041-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The below commit changed types of some hooks in struct psc_ops. It also changed the types of the functions which are referenced in the instances of the above struct. However the commit did so only for CONFIG_PPC_MPC52xx, but not for CONFIG_PPC_MPC512x. This results in build errors like: mpc52xx_uart.c:static unsigned int mpc52xx_psc_raw_tx_rdy(struct uart_port *port) mpc52xx_uart.c:static int mpc512x_psc_raw_tx_rdy(struct uart_port *port) ^^^ mpc52xx_uart.c:static int mpc5125_psc_raw_tx_rdy(struct uart_port *port) ^^^ Therefore, fix the latter case now too. Fixes: 18662a1d8f35 (tty: serial: mpc52xx_uart: make rx/tx hooks return unsigned) Cc: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Link: https://lore.kernel.org/r/20220404055122.31194-1-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'staging-5.18-rc2' of ↵Linus Torvalds2022-04-101-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging driver fix from Greg KH: "Here is a single staging driver fix for 5.18-rc2 that resolves an endian issue for the r8188eu driver. It has been in linux-next all this week with no reported problems" * tag 'staging-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: r8188eu: Fix PPPoE tag insertion on little endian systems
| * | staging: r8188eu: Fix PPPoE tag insertion on little endian systemsGuenter Roeck2022-04-041-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In __nat25_add_pppoe_tag(), the tag length is read from the tag data structure. The value is kept in network format, but read as raw value. With -Warray-bounds, this results in the following gcc error/warning when building the driver on alpha. In function '__nat25_add_pppoe_tag', inlined from 'nat25_db_handle' at drivers/staging/r8188eu/core/rtw_br_ext.c:479:11: arch/alpha/include/asm/string.h:22:16: error: '__builtin_memcpy' forming offset [40, 2051] is out of the bounds [0, 40] of object 'tag_buf' with type 'unsigned char[40]' Add the missing be16_to_cpu() to fix the compile error. It should be noted, however, that this fix means that the code did probably not work on any little endian systems and/or that the driver has other endiannes related issues. A build with C=1 suggests that this is indeed the case. This patch does not attempt to fix any of those other issues. Fixes: 15865124feed ("staging: r8188eu: introduce new core dir for RTL8188eu driver") Cc: Phillip Potter <phil@philpotter.co.uk> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20220404134338.3276991-1-linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'driver-core-5.18-rc2' of ↵Linus Torvalds2022-04-104-48/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here are two small driver core changes for 5.18-rc2. They are the final bits in the removal of the default_attrs field in struct kobj_type. I had to wait until after 5.18-rc1 for all of the changes to do this came in through different development trees, and then one new user snuck in. So this series has two changes: - removal of the default_attrs field in the powerpc/pseries/vas code. The change has been acked by the PPC maintainers to come through this tree - removal of default_attrs from struct kobj_type now that all in-kernel users are removed. This cleans up the kobject code a little bit and removes some duplicated functionality that confused people (now there is only one way to do default groups) Both of these have been in linux-next for all of this week with no reported problems" * tag 'driver-core-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kobject: kobj_type: remove default_attrs powerpc/pseries/vas: use default_groups in kobj_type
| * | kobject: kobj_type: remove default_attrsGreg Kroah-Hartman2022-04-053-46/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that all in-kernel users of default_attrs for the kobj_type are gone and converted to properly use the default_groups pointer instead, it can be safely removed. There is one standard way to create sysfs files in a kobj_type, and not two like before, causing confusion as to which should be used. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20220106133151.607703-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | powerpc/pseries/vas: use default_groups in kobj_typeGreg Kroah-Hartman2022-04-051-2/+4
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the pseries vas sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Haren Myneni <haren@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-kernel@vger.kernel.org Link: https://lore.kernel.org/r/20220329142552.558339-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'char-misc-5.18-rc2' of ↵Linus Torvalds2022-04-101-8/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fix from Greg KH: "A single driver fix. It resolves the build warning issue on 32bit systems in the habannalabs driver that came in during the 5.18-rc1 merge cycle. It has been in linux-next for all this week with no reported problems" * tag 'char-misc-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: habanalabs: Fix test build failures
| * | habanalabs: Fix test build failuresGuenter Roeck2022-04-041-8/+8
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allmodconfig builds on 32-bit architectures fail with the following error. drivers/misc/habanalabs/common/memory.c: In function 'alloc_device_memory': drivers/misc/habanalabs/common/memory.c:153:49: error: cast from pointer to integer of different size Fix the typecast. While at it, drop other unnecessary typecasts associated with the same commit. Fixes: e8458e20e0a3c ("habanalabs: make sure device mem alloc is page aligned") Cc: Ohad Sharabi <osharabi@habana.ai> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20220404134859.3278599-1-linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'powerpc-5.18-2' of ↵Linus Torvalds2022-04-1015-35/+169
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix KVM "lost kick" race, where an attempt to pull a vcpu out of the guest could be lost (or delayed until the next guest exit). - Disable SCV (system call vectored) when PR KVM guests could be run. - Fix KVM PR guests using SCV, by disallowing AIL != 0 for KVM PR guests. - Add a new KVM CAP to indicate if AIL == 3 is supported. - Fix a regression when hotplugging a CPU to a memoryless/cpuless node. - Make virt_addr_valid() stricter for 64-bit Book3E & 32-bit, which fixes crashes seen due to hardened usercopy. - Revert a change to max_mapnr which broke HIGHMEM. Thanks to Christophe Leroy, Fabiano Rosas, Kefeng Wang, Nicholas Piggin, and Srikar Dronamraju. * tag 'powerpc-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: Revert "powerpc: Set max_mapnr correctly" powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit KVM: PPC: Move kvmhv_on_pseries() into kvm_ppc.h powerpc/numa: Handle partially initialized numa nodes powerpc/64: Fix build failure with allyesconfig in book3s_64_entry.S KVM: PPC: Use KVM_CAP_PPC_AIL_MODE_3 KVM: PPC: Book3S PR: Disallow AIL != 0 KVM: PPC: Book3S PR: Disable SCV when AIL could be disabled KVM: PPC: Book3S HV P9: Fix "lost kick" race
| * | Revert "powerpc: Set max_mapnr correctly"Kefeng Wang2022-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 602946ec2f90d5bd965857753880db29d2d9a1e9. If CONFIG_HIGHMEM is enabled, no highmem will be added with max_mapnr set to max_low_pfn, see mem_init(): for (pfn = highmem_mapnr; pfn < max_mapnr; ++pfn) { ... free_highmem_page(); } Now that virt_addr_valid() has been fixed in the previous commit, we can revert the change to max_mapnr. Fixes: 602946ec2f90 ("powerpc: Set max_mapnr correctly") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reported-by: Erhard F. <erhard_f@mailbox.org> [mpe: Update change log to reflect series reordering] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220406145802.538416-2-mpe@ellerman.id.au
| * | powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bitKefeng Wang2022-04-071-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mpe: On 64-bit Book3E vmalloc space starts at 0x8000000000000000. Because of the way __pa() works we have: __pa(0x8000000000000000) == 0, and therefore virt_to_pfn(0x8000000000000000) == 0, and therefore virt_addr_valid(0x8000000000000000) == true Which is wrong, virt_addr_valid() should be false for vmalloc space. In fact all vmalloc addresses that alias with a valid PFN will return true from virt_addr_valid(). That can cause bugs with hardened usercopy as described below by Kefeng Wang: When running ethtool eth0 on 64-bit Book3E, a BUG occurred: usercopy: Kernel memory exposure attempt detected from SLUB object not in SLUB page?! (offset 0, size 1048)! kernel BUG at mm/usercopy.c:99 ... usercopy_abort+0x64/0xa0 (unreliable) __check_heap_object+0x168/0x190 __check_object_size+0x1a0/0x200 dev_ethtool+0x2494/0x2b20 dev_ioctl+0x5d0/0x770 sock_do_ioctl+0xf0/0x1d0 sock_ioctl+0x3ec/0x5a0 __se_sys_ioctl+0xf0/0x160 system_call_exception+0xfc/0x1f0 system_call_common+0xf8/0x200 The code shows below, data = vzalloc(array_size(gstrings.len, ETH_GSTRING_LEN)); copy_to_user(useraddr, data, gstrings.len * ETH_GSTRING_LEN)) The data is alloced by vmalloc(), virt_addr_valid(ptr) will return true on 64-bit Book3E, which leads to the panic. As commit 4dd7554a6456 ("powerpc/64: Add VIRTUAL_BUG_ON checks for __va and __pa addresses") does, make sure the virt addr above PAGE_OFFSET in the virt_addr_valid() for 64-bit, also add upper limit check to make sure the virt is below high_memory. Meanwhile, for 32-bit PAGE_OFFSET is the virtual address of the start of lowmem, high_memory is the upper low virtual address, the check is suitable for 32-bit, this will fix the issue mentioned in commit 602946ec2f90 ("powerpc: Set max_mapnr correctly") too. On 32-bit there is a similar problem with high memory, that was fixed in commit 602946ec2f90 ("powerpc: Set max_mapnr correctly"), but that commit breaks highmem and needs to be reverted. We can't easily fix __pa(), we have code that relies on its current behaviour. So for now add extra checks to virt_addr_valid(). For 64-bit Book3S the extra checks are not necessary, the combination of virt_to_pfn() and pfn_valid() should yield the correct result, but they are harmless. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mpe: Add additional change log detail] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220406145802.538416-1-mpe@ellerman.id.au
| * | KVM: PPC: Move kvmhv_on_pseries() into kvm_ppc.hMichael Ellerman2022-04-032-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We recently introduced a usage of kvmhv_on_pseries() in powerpc.c, which causes a build error for ppc64_book3e_allmodconfig: arch/powerpc/kvm/powerpc.c:716:8: error: implicit declaration of function ‘kvmhv_on_pseries’ 716 | if (kvmhv_on_pseries()) { | ^~~~~~~~~~~~~~~~ Fix it by moving kvmhv_on_pseries() into kvm_ppc.h so that the stub version is available for book3e builds. Fixes: f771b55731fc ("KVM: PPC: Use KVM_CAP_PPC_AIL_MODE_3") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | powerpc/numa: Handle partially initialized numa nodesSrikar Dronamraju2022-03-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") NODE_DATA for even a memoryless/cpuless node is partially initialized at boot time. Before onlining the node, current Powerpc code checks for NODE_DATA to be NULL. However since NODE_DATA is partially initialized, this check will end up always being false. This causes hotplugging a CPU to a memoryless/cpuless node to fail. Before adding CPUs: $ numactl -H available: 1 nodes (4) node 4 cpus: 0 1 2 3 4 5 6 7 node 4 size: 97372 MB node 4 free: 95545 MB node distances: node 4 4: 10 $ lparstat System Configuration type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00 %user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 2.66 2.67 0.16 94.51 0.00 0.00 5.33 0.00 67749 0 After hotplugging 32 cores: $ numactl -H node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 node 4 size: 97372 MB node 4 free: 93636 MB node distances: node 4 4: 10 $ lparstat System Configuration type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00 %user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0.04 0.02 0.00 99.94 0.00 0.00 0.06 0.00 1128751 3 As we can see numactl is listing only 8 cores while lparstat is showing 33 cores. Also dmesg is showing messages like: [ 2261.318350 ] BUG: arch topology borken [ 2261.318357 ] the DIE domain not a subset of the NODE domain Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220330135123.1868197-1-srikar@linux.vnet.ibm.com
| * | powerpc/64: Fix build failure with allyesconfig in book3s_64_entry.SChristophe Leroy2022-03-281-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using conditional branches between two files is hasardous, they may get linked too far from each other. arch/powerpc/kvm/book3s_64_entry.o:(.text+0x3ec): relocation truncated to fit: R_PPC64_REL14 (stub) against symbol `system_reset_common' defined in .text section in arch/powerpc/kernel/head_64.o Reorganise the code to use non conditional branches. Fixes: 89d35b239101 ("KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mpe: Avoid odd-looking bne ., use named local labels] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/89cf27bf43ee07a0b2879b9e8e2f5cd6386a3645.1648366338.git.christophe.leroy@csgroup.eu
| * | Merge branch 'topic/ppc-kvm' into nextMichael Ellerman2022-03-289-18/+142
| |\ \ | | | | | | | | | | | | | | | | | | | | Merge some more commits from our KVM topic branch. In particular this brings in some commits that depend on a new capability that was merged via the KVM tree for v5.18.
| | * | KVM: PPC: Use KVM_CAP_PPC_AIL_MODE_3Nicholas Piggin2022-03-083-1/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use KVM_CAP_PPC_AIL_MODE_3 to advertise the capability to set the AIL resource mode to 3 with the H_SET_MODE hypercall. This capability differs between processor types and KVM types (PR, HV, Nested HV), and affects guest-visible behaviour. QEMU will implement a cap-ail-mode-3 to control this behaviour[1], and use the KVM CAP if available to determine KVM support[2]. [1] https://lists.nongnu.org/archive/html/qemu-ppc/2022-02/msg00437.html [2] https://lists.nongnu.org/archive/html/qemu-ppc/2022-02/msg00439.html Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> [mpe: Rebase onto 93b71801a827 from kvm-ppc-cap-210 branch, add EXPORT_SYMBOL] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220222064727.2314380-4-npiggin@gmail.com
| | * | KVM: PPC: Book3S PR: Disallow AIL != 0Nicholas Piggin2022-03-081-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM PR does not implement address translation modes on interrupt, so it must not allow H_SET_MODE to succeed. The behaviour change caused by this mode is architected and not advisory (interrupts *must* behave differently). QEMU does not deal with differences in AIL support in the host. The solution to that is a spapr capability and corresponding KVM CAP, but this patch does not break things more than before (the host behaviour already differs, this change just disallows some modes that are not implemented properly). By happy coincidence, this allows PR Linux guests that are using the SCV facility to boot and run, because Linux disables the use of SCV if AIL can not be set to 3. This does not fix the underlying problem of missing SCV support (an OS could implement real-mode SCV vectors and try to enable the facility). The true fix for that is for KVM PR to emulate scv interrupts from the facility unavailable interrupt. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Link: https://lore.kernel.org/r/20220222064727.2314380-3-npiggin@gmail.com
| | * | KVM: PPC: Book3S PR: Disable SCV when AIL could be disabledNicholas Piggin2022-03-084-9/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PR KVM does not support running with AIL enabled, and SCV does is not supported with AIL disabled. Fix this by ensuring the SCV facility is disabled with FSCR while a CPU could be running with AIL=0. The PowerNV host supports disabling AIL on a per-CPU basis, so SCV just needs to be disabled when a vCPU is being run. The pSeries machine can only switch AIL on a system-wide basis, so it must disable SCV support at boot if the configuration can potentially run a PR KVM guest. Also ensure a the FSCR[SCV] bit can not be enabled when emulating mtFSCR for the guest. SCV is not emulated for the PR guest at the moment, this just fixes the host crashes. Alternatives considered and rejected: - SCV support can not be disabled by PR KVM after boot, because it is advertised to userspace with HWCAP. - AIL can not be disabled on a per-CPU basis. At least when running on pseries it is a per-LPAR setting. - Support for real-mode SCV vectors will not be added because they are at 0x17000 so making such a large fixed head space causes immediate value limits to be exceeded, requiring a lot rework and more code. - Disabling SCV for any PR KVM possible kernel will cause a slowdown when not using PR KVM. - A boot time option to disable SCV to use PR KVM is user-hostile. - System call instruction emulation for SCV facility unavailable instructions is too complex and old emulation code was subtly broken and removed. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Link: https://lore.kernel.org/r/20220222064727.2314380-2-npiggin@gmail.com