summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* xfs: use iomap_valid method to detect stale cached iomapsDave Chinner2022-11-285-27/+87
| | | | | | | | | | | | | | | | | | | | | | | Now that iomap supports a mechanism to validate cached iomaps for buffered write operations, hook it up to the XFS buffered write ops so that we can avoid data corruptions that result from stale cached iomaps. See: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/ or the ->iomap_valid() introduction commit for exact details of the corruption vector. The validity cookie we store in the iomap is based on the type of iomap we return. It is expected that the iomap->flags we set in xfs_bmbt_to_iomap() is not perturbed by the iomap core and are returned to us in the iomap passed via the .iomap_valid() callback. This ensures that the validity cookie is always checking the correct inode fork sequence numbers to detect potential changes that affect the extent cached by the iomap. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
* iomap: write iomap validity checksDave Chinner2022-11-283-10/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent multithreaded write data corruption has been uncovered in the iomap write code. The core of the problem is partial folio writes can be flushed to disk while a new racing write can map it and fill the rest of the page: writeback new write allocate blocks blocks are unwritten submit IO ..... map blocks iomap indicates UNWRITTEN range loop { lock folio copyin data ..... IO completes runs unwritten extent conv blocks are marked written <iomap now stale> get next folio } Now add memory pressure such that memory reclaim evicts the partially written folio that has already been written to disk. When the new write finally gets to the last partial page of the new write, it does not find it in cache, so it instantiates a new page, sees the iomap is unwritten, and zeros the part of the page that it does not have data from. This overwrites the data on disk that was originally written. The full description of the corruption mechanism can be found here: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/ To solve this problem, we need to check whether the iomap is still valid after we lock each folio during the write. We have to do it after we lock the page so that we don't end up with state changes occurring while we wait for the folio to be locked. Hence we need a mechanism to be able to check that the cached iomap is still valid (similar to what we already do in buffered writeback), and we need a way for ->begin_write to back out and tell the high level iomap iterator that we need to remap the remaining write range. The iomap needs to grow some storage for the validity cookie that the filesystem provides to travel with the iomap. XFS, in particular, also needs to know some more information about what the iomap maps (attribute extents rather than file data extents) to for the validity cookie to cover all the types of iomaps we might need to validate. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
* xfs: xfs_bmap_punch_delalloc_range() should take a byte rangeDave Chinner2022-11-284-21/+15
| | | | | | | | | | | All the callers of xfs_bmap_punch_delalloc_range() jump through hoops to convert a byte range to filesystem blocks before calling xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to xfs_bmap_punch_delalloc_range() and have it do the conversion to filesystem blocks internally. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
* iomap: buffered write failure should not truncate the page cacheDave Chinner2022-11-281-15/+180
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iomap_file_buffered_write_punch_delalloc() currently invalidates the page cache over the unused range of the delalloc extent that was allocated. While the write allocated the delalloc extent, it does not own it exclusively as the write does not hold any locks that prevent either writeback or mmap page faults from changing the state of either the page cache or the extent state backing this range. Whilst xfs_bmap_punch_delalloc_range() already handles races in extent conversion - it will only punch out delalloc extents and it ignores any other type of extent - the page cache truncate does not discriminate between data written by this write or some other task. As a result, truncating the page cache can result in data corruption if the write races with mmap modifications to the file over the same range. generic/346 exercises this workload, and if we randomly fail writes (as will happen when iomap gets stale iomap detection later in the patchset), it will randomly corrupt the file data because it removes data written by mmap() in the same page as the write() that failed. Hence we do not want to punch out the page cache over the range of the extent we failed to write to - what we actually need to do is detect the ranges that have dirty data in cache over them and *not punch them out*. To do this, we have to walk the page cache over the range of the delalloc extent we want to remove. This is made complex by the fact we have to handle partially up-to-date folios correctly and this can happen even when the FSB size == PAGE_SIZE because we now support multi-page folios in the page cache. Because we are only interested in discovering the edges of data ranges in the page cache (i.e. hole-data boundaries) we can make use of mapping_seek_hole_data() to find those transitions in the page cache. As we hold the invalidate_lock, we know that the boundaries are not going to change while we walk the range. This interface is also byte-based and is sub-page block aware, so we can find the data ranges in the cache based on byte offsets rather than page, folio or fs block sized chunks. This greatly simplifies the logic of finding dirty cached ranges in the page cache. Once we've identified a range that contains cached data, we can then iterate the range folio by folio. This allows us to determine if the data is dirty and hence perform the correct delalloc extent punching operations. The seek interface we use to iterate data ranges will give us sub-folio start/end granularity, so we may end up looking up the same folio multiple times as the seek interface iterates across each discontiguous data region in the folio. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
* xfs,iomap: move delalloc punching to iomapDave Chinner2022-11-233-39/+72
| | | | | | | | | | | | | | | Because that's what Christoph wants for this error handling path only XFS uses. It requires a new iomap export for handling errors over delalloc ranges. This is basically the XFS code as is stands, but even though Christoph wants this as iomap funcitonality, we still have to call it from the filesystem specific ->iomap_end callback, and call into the iomap code with yet another filesystem specific callback to punch the delalloc extent within the defined ranges. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
* xfs: use byte ranges for write cleanup rangesDave Chinner2022-11-231-15/+25
| | | | | | | | | | | | | | | | | | | | | xfs_buffered_write_iomap_end() currently converts the byte ranges passed to it to filesystem blocks to pass them to the bmap code to punch out delalloc blocks, but then has to convert filesytem blocks back to byte ranges for page cache truncate. We're about to make the page cache truncate go away and replace it with a page cache walk, so having to convert everything to/from/to filesystem blocks is messy and error-prone. It is much easier to pass around byte ranges and convert to page indexes and/or filesystem blocks only where those units are needed. In preparation for the page cache walk being added, add a helper that converts byte ranges to filesystem blocks and calls xfs_bmap_punch_delalloc_range() and convert xfs_buffered_write_iomap_end() to calculate limits in byte ranges. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
* xfs: punching delalloc extents on write failure is racyDave Chinner2022-11-231-18/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_buffered_write_iomap_end() has a comment about the safety of punching delalloc extents based holding the IOLOCK_EXCL. This comment is wrong, and punching delalloc extents is not race free. When we punch out a delalloc extent after a write failure in xfs_buffered_write_iomap_end(), we punch out the page cache with truncate_pagecache_range() before we punch out the delalloc extents. At this point, we only hold the IOLOCK_EXCL, so there is nothing stopping mmap() write faults racing with this cleanup operation, reinstantiating a folio over the range we are about to punch and hence requiring the delalloc extent to be kept. If this race condition is hit, we can end up with a dirty page in the page cache that has no delalloc extent or space reservation backing it. This leads to bad things happening at writeback time. To avoid this race condition, we need the page cache truncation to be atomic w.r.t. the extent manipulation. We can do this by holding the mapping->invalidate_lock exclusively across this operation - this will prevent new pages from being inserted into the page cache whilst we are removing the pages and the backing extent and space reservation. Taking the mapping->invalidate_lock exclusively in the buffered write IO path is safe - it naturally nests inside the IOLOCK (see truncate and fallocate paths). iomap_zero_range() can be called from under the mapping->invalidate_lock (from the truncate path via either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter() will not instantiate new delalloc pages (because it skips holes) and hence will not ever need to punch out delalloc extents on failure. Fix the locking issue, and clean up the code logic a little to avoid unnecessary work if we didn't allocate the delalloc extent or wrote the entire region we allocated. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
* xfs: write page faults in iomap are not buffered writesDave Chinner2022-11-073-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we reserve a delalloc region in xfs_buffered_write_iomap_begin, we mark the iomap as IOMAP_F_NEW so that the the write context understands that it allocated the delalloc region. If we then fail that buffered write, xfs_buffered_write_iomap_end() checks for the IOMAP_F_NEW flag and if it is set, it punches out the unused delalloc region that was allocated for the write. The assumption this code makes is that all buffered write operations that can allocate space are run under an exclusive lock (i_rwsem). This is an invalid assumption: page faults in mmap()d regions call through this same function pair to map the file range being faulted and this runs only holding the inode->i_mapping->invalidate_lock in shared mode. IOWs, we can have races between page faults and write() calls that fail the nested page cache write operation that result in data loss. That is, the failing iomap_end call will punch out the data that the other racing iomap iteration brought into the page cache. This can be reproduced with generic/34[46] if we arbitrarily fail page cache copy-in operations from write() syscalls. Code analysis tells us that the iomap_page_mkwrite() function holds the already instantiated and uptodate folio locked across the iomap mapping iterations. Hence the folio cannot be removed from memory whilst we are mapping the range it covers, and as such we do not care if the mapping changes state underneath the iomap iteration loop: 1. if the folio is not already dirty, there is no writeback races possible. 2. if we allocated the mapping (delalloc or unwritten), the folio cannot already be dirty. See #1. 3. If the folio is already dirty, it must be up to date. As we hold it locked, it cannot be reclaimed from memory. Hence we always have valid data in the page cache while iterating the mapping. 4. Valid data in the page cache can exist when the underlying mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not change the data in the page - it only affects actions if we are initialising a new page. Hence #3 applies and we don't care about these extent map transitions racing with iomap_page_mkwrite(). 5. iomap_page_mkwrite() checks for page invalidation races (truncate, hole punch, etc) after it locks the folio. We also hold the mapping->invalidation_lock here, and hence the mapping cannot change due to extent removal operations while we are iterating the folio. As such, filesystems that don't use bufferheads will never fail the iomap_folio_mkwrite_iter() operation on the current mapping, regardless of whether the iomap should be considered stale. Further, the range we are asked to iterate is limited to the range inside EOF that the folio spans. Hence, for XFS, we will only map the exact range we are asked for, and we will only do speculative preallocation with delalloc if we are mapping a hole at the EOF page. The iterator will consume the entire range of the folio that is within EOF, and anything beyond the EOF block cannot be accessed. We never need to truncate this post-EOF speculative prealloc away in the context of the iomap_page_mkwrite() iterator because if it remains unused we'll remove it when the last reference to the inode goes away. Hence we don't actually need an .iomap_end() cleanup/error handling path at all for iomap_page_mkwrite() for XFS. This means we can separate the page fault processing from the complexity of the .iomap_end() processing in the buffered write path. This also means that the buffered write path will also be able to take the mapping->invalidate_lock as necessary. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
* Linux 6.1-rc4v6.1-rc4Linus Torvalds2022-11-071-1/+1
|
* Merge tag 'cxl-fixes-for-6.1-rc4' of ↵Linus Torvalds2022-11-068-91/+448
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fixes from Dan Williams: "Several fixes for CXL region creation crashes, leaks and failures. This is mainly fallout from the original implementation of dynamic CXL region creation (instantiate new physical memory pools) that arrived in v6.0-rc1. Given the theme of "failures in the presence of pass-through decoders" this also includes new regression test infrastructure for that case. Summary: - Fix region creation crash with pass-through decoders - Fix region creation crash when no decoder allocation fails - Fix region creation crash when scanning regions to enforce the increasing physical address order constraint that CXL mandates - Fix a memory leak for cxl_pmem_region objects, track 1:N instead of 1:1 memory-device-to-region associations. - Fix a memory leak for cxl_region objects when regions with active targets are deleted - Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window) emulated proximity domains. - Fix region creation failure for switch attached devices downstream of a single-port host-bridge - Fix false positive memory leak of cxl_region objects by recycling recently used region ids rather than freeing them - Add regression test infrastructure for a pass-through decoder configuration - Fix some mailbox payload handling corner cases" * tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/region: Recycle region ids cxl/region: Fix 'distance' calculation with passthrough ports tools/testing/cxl: Add a single-port host-bridge regression config tools/testing/cxl: Fix some error exits cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak cxl/region: Fix cxl_region leak, cleanup targets at region delete cxl/region: Fix region HPA ordering validation cxl/pmem: Use size_add() against integer overflow cxl/region: Fix decoder allocation crash ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA. cxl/region: Fix null pointer dereference due to pass through decoder commit cxl/mbox: Add a check on input payload size
| * cxl/region: Recycle region idsDan Williams2022-11-051-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At region creation time the next region-id is atomically cached so that there is predictability of region device names. If that region is destroyed and then a new one is created the region id increments. That ends up looking like a memory leak, or is otherwise surprising that identifiers roll forward even after destroying all previously created regions. Try to reuse rather than free old region ids at region release time. While this fixes a cosmetic issue, the needlessly advancing memory region-id gives the appearance of a memory leak, hence the "Fixes" tag, but no "Cc: stable" tag. Cc: Ben Widawsky <bwidawsk@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Fixes: 779dd20cfb56 ("cxl/region: Add region creation support") Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/166752186062.947915.13200195701224993317.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * cxl/region: Fix 'distance' calculation with passthrough portsDan Williams2022-11-053-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When programming port decode targets, the algorithm wants to ensure that two devices are compatible to be programmed as peers beneath a given port. A compatible peer is a target that shares the same dport, and where that target's interleave position also routes it to the same dport. Compatibility is determined by the device's interleave position being >= to distance. For example, if a given dport can only map every Nth position then positions less than N away from the last target programmed are incompatible. The @distance for the host-bridge's cxl_port in a simple dual-ported host-bridge configuration with 2 direct-attached devices is 1, i.e. An x2 region divided by 2 dports to reach 2 region targets. An x4 region under an x2 host-bridge would need 2 intervening switches where the @distance at the host bridge level is 2 (x4 region divided by 2 switches to reach 4 devices). However, the distance between peers underneath a single ported host-bridge is always zero because there is no limit to the number of devices that can be mapped. In other words, there are no decoders to program in a passthrough, all descendants are mapped and distance only starts matters for the intervening descendant ports of the passthrough port. Add tracking for the number of dports mapped to a port, and use that to detect the passthrough case for calculating @distance. Cc: <stable@vger.kernel.org> Reported-by: Bobo WL <lmw.bobo@gmail.com> Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: http://lore.kernel.org/r/20221010172057.00001559@huawei.com Fixes: 27b3f8d13830 ("cxl/region: Program target lists") Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/166752185440.947915.6617495912508299445.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * tools/testing/cxl: Add a single-port host-bridge regression configDan Williams2022-11-051-19/+278
| | | | | | | | | | | | | | | | | | | | | | | | | | Jonathan reports that region creation fails when a single-port host-bridge connects to a multi-port switch. Mock up that configuration so a fix can be tested and regression tested going forward. Reported-by: Bobo WL <lmw.bobo@gmail.com> Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: http://lore.kernel.org/r/20221010172057.00001559@huawei.com Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/166752184838.947915.2167957540894293891.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * tools/testing/cxl: Fix some error exitsDan Williams2022-11-051-2/+2
| | | | | | | | | | | | | | | | | | | | Fix a few typos where 'goto err_port' was used rather than the object specific cleanup. Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/166752184255.947915.16163477849330181425.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * cxl/pmem: Fix cxl_pmem_region and cxl_memdev leakDan Williams2022-11-043-37/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a cxl_nvdimm object goes through a ->remove() event (device physically removed, nvdimm-bridge disabled, or nvdimm device disabled), then any associated regions must also be disabled. As highlighted by the cxl-create-region.sh test [1], a single device may host multiple regions, but the driver was only tracking one region at a time. This leads to a situation where only the last enabled region per nvdimm device is cleaned up properly. Other regions are leaked, and this also causes cxl_memdev reference leaks. Fix the tracking by allowing cxl_nvdimm objects to track multiple region associations. Cc: <stable@vger.kernel.org> Link: https://github.com/pmem/ndctl/blob/main/test/cxl-create-region.sh [1] Reported-by: Vishal Verma <vishal.l.verma@intel.com> Fixes: 04ad63f086d1 ("cxl/region: Introduce cxl_pmem_region objects") Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/166752183647.947915.2045230911503793901.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * cxl/region: Fix cxl_region leak, cleanup targets at region deleteDan Williams2022-11-041-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a region is deleted any targets that have been previously assigned to that region hold references to it. Trigger those references to drop by detaching all targets at unregister_region() time. Otherwise that region object will leak as userspace has lost the ability to detach targets once region sysfs is torn down. Cc: <stable@vger.kernel.org> Fixes: b9686e8c8e39 ("cxl/region: Enable the assignment of endpoint decoders to regions") Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/166752183055.947915.17681995648556534844.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * cxl/region: Fix region HPA ordering validationDan Williams2022-11-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some regions may not have any address space allocated. Skip them when validating HPA order otherwise a crash like the following may result: devm_cxl_add_region: cxl_acpi cxl_acpi.0: decoder3.4: created region9 BUG: kernel NULL pointer dereference, address: 0000000000000000 [..] RIP: 0010:store_targetN+0x655/0x1740 [cxl_core] [..] Call Trace: <TASK> kernfs_fop_write_iter+0x144/0x200 vfs_write+0x24a/0x4d0 ksys_write+0x69/0xf0 do_syscall_64+0x3a/0x90 store_targetN+0x655/0x1740: alloc_region_ref at drivers/cxl/core/region.c:676 (inlined by) cxl_port_attach_region at drivers/cxl/core/region.c:850 (inlined by) cxl_region_attach at drivers/cxl/core/region.c:1290 (inlined by) attach_target at drivers/cxl/core/region.c:1410 (inlined by) store_targetN at drivers/cxl/core/region.c:1453 Cc: <stable@vger.kernel.org> Fixes: 384e624bb211 ("cxl/region: Attach endpoint decoders") Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/166752182461.947915.497032805239915067.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * cxl/pmem: Use size_add() against integer overflowYu Zhe2022-11-031-1/+1
| | | | | | | | | | | | | | | | | | "struct_size() + n" may cause a integer overflow, use size_add() to handle it. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Link: https://lore.kernel.org/r/20220927070247.23148-1-yuzhe@nfschina.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * cxl/region: Fix decoder allocation crashVishal Verma2022-11-011-26/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an intermediate port's decoders have been exhausted by existing regions, and creating a new region with the port in question in it's hierarchical path is attempted, cxl_port_attach_region() fails to find a port decoder (as would be expected), and drops into the failure / cleanup path. However, during cleanup of the region reference, a sanity check attempts to dereference the decoder, which in the above case didn't exist. This causes a NULL pointer dereference BUG. To fix this, refactor the decoder allocation and de-allocation into helper routines, and in this 'free' routine, check that the decoder, @cxld, is valid before attempting any operations on it. Cc: <stable@vger.kernel.org> Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Fixes: 384e624bb211 ("cxl/region: Attach endpoint decoders") Link: https://lore.kernel.org/r/20221101074100.1732003-1-vishal.l.verma@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes setDan Williams2022-10-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ACPI CEDT.CFMWS indicates a range of possible address where new CXL regions can appear. Each range is associated with a QTG id (QoS Throttling Group id). For each range + QTG pair that is not covered by a proximity domain in the SRAT, Linux creates a new NUMA node. However, the commit that added the new ranges missed updating the node_possible mask which causes memory_group_register() to fail. Add the new nodes to the nodes_possible mask. Cc: <stable@vger.kernel.org> Fixes: fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT") Cc: Alison Schofield <alison.schofield@intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/166631003537.1167078.9373680312035292395.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.Jonathan Cameron2022-10-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Writes to the device must include an offset and size as defined in CXL 2.0 8.2.9.5.2.4 Set LSA (Opcode 4103h) Fixes tag is non obvious as this code has been through several reworks and variable names + wasn't in use until the addition of the region code. Due to a bug in QEMU CXL emulation this overrun resulted in QEMU crashing. Reported-by: Bobo WL <lmw.bobo@gmail.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Fixes: 60b8f17215de ("cxl/pmem: Translate NVDIMM label commands to CXL label commands") Link: https://lore.kernel.org/r/20220815154044.24733-3-Jonathan.Cameron@huawei.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * cxl/region: Fix null pointer dereference due to pass through decoder commitJonathan Cameron2022-10-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not all decoders have a commit callback. The CXL specification allows a host bridge with a single root port to have no explicit HDM decoders. Currently the region driver assumes there are none. As such the CXL core creates a special pass through decoder instance without a commit callback. Prior to this patch, the ->commit() callback was called unconditionally. Thus a configuration with 1 Host Bridge, 1 Root Port, 1 switch with multiple downstream ports below which there are multiple CXL type 3 devices results in a situation where committing the region causes a null pointer dereference. Reported-by: Bobo WL <lmw.bobo@gmail.com> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20220818164210.2084-1-Jonathan.Cameron@huawei.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * cxl/mbox: Add a check on input payload sizeJonathan Cameron2022-10-211-1/+1
| | | | | | | | | | | | | | | | | | | | A bug in the LSA code resulted in transfers slightly larger than the mailbox size. Let us make it easier to catch similar issues in future by adding a low level check. Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20220815154044.24733-2-Jonathan.Cameron@huawei.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | Merge tag 'hwmon-for-v6.1-rc4' of ↵Linus Torvalds2022-11-062-14/+103
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon fixes from Guenter Roeck: "Fix two regressions: - Commit 54cc3dbfc10d ("hwmon: (pmbus) Add regulator supply into macro") resulted in regulator undercount when disabling regulators. Revert it. - The thermal subsystem rework caused the scmi driver to no longer register with the thermal subsystem because index values no longer match. To fix the problem, the scmi driver now directly registers with the thermal subsystem, no longer through the hwmon core" * tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: Revert "hwmon: (pmbus) Add regulator supply into macro" hwmon: (scmi) Register explicitly with Thermal Framework
| * | Revert "hwmon: (pmbus) Add regulator supply into macro"Guenter Roeck2022-11-051-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 54cc3dbfc10dc3db7cb1cf49aee4477a8398fbde. Zev Weiss reports that the reverted patch may cause a regulator undercount. Here is his report: ... having regulator-dummy set as a supply on my PMBus regulators (instead of having them as their own top-level regulators without an upstream supply) leads to enable-count underflow errors when disabling them: # echo 0 > /sys/bus/platform/devices/efuse01/state [ 906.094477] regulator-dummy: Underflow of regulator enable count [ 906.100563] Failed to disable vout: -EINVAL [ 136.992676] reg-userspace-consumer efuse01: Failed to configure state: -22 Zev reports that reverting the patch fixes the problem. So let's do that for now. Fixes: 54cc3dbfc10d ("hwmon: (pmbus) Add regulator supply into macro") Cc: Marcello Sylvester Bauer <sylv@sylv.io> Reported-by: Zev Weiss <zev@bewilderbeest.net> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
| * | hwmon: (scmi) Register explicitly with Thermal FrameworkCristian Marussi2022-11-051-13/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Available sensors are enumerated and reported by the SCMI platform server using a 16bit identification number; not all such sensors are of a type supported by hwmon subsystem and, among the supported ones, only a subset could be temperature sensors that have to be registered with the Thermal Framework. Potential clashes between hwmon channels indexes and the underlying real sensors IDs do not play well with the hwmon<-->thermal bridge automatic registration routines and could need a sensible number of fake dummy sensors to be made up in order to keep indexes and IDs in sync. Avoid to use the hwmon<-->thermal bridge dropping the HWMON_C_REGISTER_TZ attribute and instead explicit register temperature sensors directly with the Thermal Framework. Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: linux-hwmon@vger.kernel.org Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20221031114018.59048-1-cristian.marussi@arm.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
* | | Merge tag 'perf_urgent_for_v6.1_rc4' of ↵Linus Torvalds2022-11-064-11/+18
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Add Cooper Lake's stepping to the PEBS guest/host events isolation fixed microcode revisions checking quirk - Update Icelake and Sapphire Rapids events constraints - Use the standard energy unit for Sapphire Rapids in RAPL - Fix the hw_breakpoint test to fail more graciously on !SMP configs * tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[] perf/x86/intel: Fix pebs event constraints for SPR perf/x86/intel: Fix pebs event constraints for ICL perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain perf/hw_breakpoint: test: Skip the test if dependencies unmet
| * | | perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]Kan Liang2022-11-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The intel_pebs_isolation quirk checks both model number and stepping. Cooper Lake has a different stepping (11) than the other Skylake Xeon. It cannot benefit from the optimization in commit 9b545c04abd4f ("perf/x86/kvm: Avoid unnecessary work in guest filtering"). Add the stepping of Cooper Lake into the isolation_ucodes[] table. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20221031154550.571663-1-kan.liang@linux.intel.com
| * | | perf/x86/intel: Fix pebs event constraints for SPRKan Liang2022-11-021-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to the latest event list, update the MEM_INST_RETIRED events which support the DataLA facility for SPR. Fixes: 61b985e3e775 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20221031154119.571386-2-kan.liang@linux.intel.com
| * | | perf/x86/intel: Fix pebs event constraints for ICLKan Liang2022-11-021-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to the latest event list, update the MEM_INST_RETIRED events which support the DataLA facility. Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support") Reported-by: Jannis Klinkenberg <jannis.klinkenberg@rwth-aachen.de> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20221031154119.571386-1-kan.liang@linux.intel.com
| * | | perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domainZhang Rui2022-11-021-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Intel Xeon servers used to use a fixed energy resolution (15.3uj) for Dram RAPL domain. But on SPR, Dram RAPL domain follows the standard energy resolution as described in MSR_RAPL_POWER_UNIT. Remove the SPR Dram energy unit quirk. Fixes: bcfd218b6679 ("perf/x86/rapl: Add support for Intel SPR platform") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Link: https://lkml.kernel.org/r/20220924054738.12076-3-rui.zhang@intel.com
| * | | perf/hw_breakpoint: test: Skip the test if dependencies unmetDavid Gow2022-11-021-2/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running the test currently fails on non-SMP systems, despite being enabled by default. This means that running the test with: ./tools/testing/kunit/kunit.py run --arch x86_64 hw_breakpoint results in every hw_breakpoint test failing with: # test_one_cpu: failed to initialize: -22 not ok 1 - test_one_cpu Instead, use kunit_skip(), which will mark the test as skipped, and give a more comprehensible message: ok 1 - test_one_cpu # SKIP not enough cpus This makes it more obvious that the test is not suited to the test environment, and so wasn't run, rather than having run and failed. Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Latypov <dlatypov@google.com> Acked-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20221026141040.1609203-1-davidgow@google.com
* | | Merge tag 'x86_urgent_for_v6.1_rc4' of ↵Linus Torvalds2022-11-063-9/+29
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Add new Intel CPU models - Enforce that TDX guests are successfully loaded only on TDX hardware where virtualization exception (#VE) delivery on kernel memory is disabled because handling those in all possible cases is "essentially impossible" - Add the proper include to the syscall wrappers so that BTF can see the real pt_regs definition and not only the forward declaration * tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add several Intel server CPU model numbers x86/tdx: Panic on bad configs that #VE on "private" memory access x86/tdx: Prepare for using "INFO" call for a second purpose x86/syscall: Include asm/ptrace.h in syscall_wrapper header
| * | | x86/cpu: Add several Intel server CPU model numbersTony Luck2022-11-041-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These servers are all on the public versions of the roadmap. The model numbers for Grand Ridge, Granite Rapids, and Sierra Forest were included in the September 2022 edition of the Instruction Set Extensions document. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20221103203310.5058-1-tony.luck@intel.com
| * | | x86/tdx: Panic on bad configs that #VE on "private" memory accessKirill A. Shutemov2022-11-021-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All normal kernel memory is "TDX private memory". This includes everything from kernel stacks to kernel text. Handling exceptions on arbitrary accesses to kernel memory is essentially impossible because they can happen in horribly nasty places like kernel entry/exit. But, TDX hardware can theoretically _deliver_ a virtualization exception (#VE) on any access to private memory. But, it's not as bad as it sounds. TDX can be configured to never deliver these exceptions on private memory with a "TD attribute" called ATTR_SEPT_VE_DISABLE. The guest has no way to *set* this attribute, but it can check it. Ensure ATTR_SEPT_VE_DISABLE is set in early boot. panic() if it is unset. There is no sane way for Linux to run with this attribute clear so a panic() is appropriate. There's small window during boot before the check where kernel has an early #VE handler. But the handler is only for port I/O and will also panic() as soon as it sees any other #VE, such as a one generated by a private memory access. [ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo(). Add Kirill's tested-by because I made changes since he wrote this. ] Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest") Reported-by: ruogui.ygr@alibaba-inc.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com
| * | | x86/tdx: Prepare for using "INFO" call for a second purposeDave Hansen2022-11-011-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TDG.VP.INFO TDCALL provides the guest with various details about the TDX system that the guest needs to run. Only one field is currently used: 'gpa_width' which tells the guest which PTE bits mark pages shared or private. A second field is now needed: the guest "TD attributes" to tell if virtualization exceptions are configured in a way that can harm the guest. Make the naming and calling convention more generic and discrete from the mask-centric one. Thanks to Sathya for the inspiration here, but there's no code, comments or changelogs left from where he started. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org
| * | | x86/syscall: Include asm/ptrace.h in syscall_wrapper headerJiri Olsa2022-10-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With just the forward declaration of the 'struct pt_regs' in syscall_wrapper.h, the syscall stub functions: __[x64|ia32]_sys_*(struct pt_regs *regs) will have different definition of 'regs' argument in BTF data based on which object file they are defined in. If the syscall's object includes 'struct pt_regs' definition, the BTF argument data will point to a 'struct pt_regs' record, like: [226] STRUCT 'pt_regs' size=168 vlen=21 'r15' type_id=1 bits_offset=0 'r14' type_id=1 bits_offset=64 'r13' type_id=1 bits_offset=128 ... If not, it will point to a fwd declaration record: [15439] FWD 'pt_regs' fwd_kind=struct and make bpf tracing program hooking on those functions unable to access fields from 'struct pt_regs'. Include asm/ptrace.h directly in syscall_wrapper.h to make sure all syscalls see 'struct pt_regs' definition. This then results in BTF for '__*_sys_*(struct pt_regs *regs)' functions to point to the actual struct, not just the forward declaration. [ bp: No Fixes tag as this is not really a bug fix but "adjustment" so that BTF is happy. ] Reported-by: Akihiro HARAI <jharai0815@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: <stable@vger.kernel.org> # this is needed only for BTF so kernels >= 5.15 Link: https://lore.kernel.org/r/20221018122708.823792-1-jolsa@kernel.org
* | | | Merge tag 'kbuild-fixes-v6.1-2' of ↵Linus Torvalds2022-11-064-21/+16
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Use POSIX-compatible grep options - Document git-related tips for reproducible builds - Fix a typo in the modpost rule - Suppress SIGPIPE error message from gcc-ar and llvm-ar - Fix segmentation fault in the menuconfig search * tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kconfig: fix segmentation fault in menuconfig search kbuild: fix SIGPIPE error message for AR=gcc-ar and AR=llvm-ar kbuild: fix typo in modpost Documentation: kbuild: Add description of git for reproducible builds kbuild: use POSIX-compatible grep option
| * | | | kconfig: fix segmentation fault in menuconfig searchMasahiro Yamada2022-11-021-19/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit d05377e184fc ("kconfig: Create links to main menu items in search"), menuconfig shows a jump key next to "Main menu" if the nearest visible parent is the rootmenu. If you press that jump key, menuconfig crashes with a segmentation fault. For example, do this: $ make ARCH=arm64 allnoconfig menuconfig Press '/' to search for the string "ACPI". Press '1' to choose "(1) Main menu". Then, menuconfig crashed with a segmentation fault. The following code in search_conf() conf(targets[i]->parent, targets[i]); results in NULL pointer dereference because targets[i] is the rootmenu, which does not have a parent. Commit d05377e184fc tried to fix the issue of top-level items not having a jump key, but adding the "Main menu" was not the right fix. The correct fix is to show the searched item itself. This fixes another weird behavior described in the comment block. Fixes: d05377e184fc ("kconfig: Create links to main menu items in search") Reported-by: Johannes Zink <j.zink@pengutronix.de> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Tested-by: Johannes Zink <j.zink@pengutronix.de>
| * | | | kbuild: fix SIGPIPE error message for AR=gcc-ar and AR=llvm-arMasahiro Yamada2022-10-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jiri Slaby reported that building the kernel with AR=gcc-ar shows: /usr/bin/ar terminated with signal 13 [Broken pipe] Nathan Chancellor reported the latest AR=llvm-ar shows: error: write on a pipe with no reader The latter occurs since LLVM commit 51b557adc131 ("Add an error message to the default SIGPIPE handler"). The resulting vmlinux is correct, but it is better to silence it. 'head -n1' exits after reading the first line, so the pipe is closed. Use 'sed -n 1p' to eat the stream till the end. Fixes: 321648455061 ("kbuild: use obj-y instead extra-y for objects placed at the head") Link: https://github.com/ClangBuiltLinux/linux/issues/1651 Reported-by: Jiri Slaby <jirislaby@kernel.org> Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org>
| * | | | kbuild: fix typo in modpostWill McVicker2022-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f73edc8951b2 ("kbuild: unify two modpost invocations") introduced a typo (moudle.symvers-if-present) which results in the kernel's Module.symvers to not be included as a prerequisite for $(KBUILD_EXTMOD)/Module.symvers. Fix the typo to restore the intended functionality. Fixes: f73edc8951b2 ("kbuild: unify two modpost invocations") Signed-off-by: Will McVicker <willmcvicker@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
| * | | | Documentation: kbuild: Add description of git for reproducible buildsDan Li2022-10-271-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The status of git will affect the final compilation result, add it to the documentation of reproducible builds. Signed-off-by: Dan Li <ashimida@linux.alibaba.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
| * | | | kbuild: use POSIX-compatible grep optionStefan Hansson2022-10-271-1/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | --file is a GNU extension to grep which is not available in all implementations (such as BusyBox). Use the -f option instead which is eqvuialent according to the GNU grep manpage[1] and is present in POSIX[2]. [1] https://www.gnu.org/software/grep/manual/grep.html [2] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html Signed-off-by: Stefan Hansson <newbie13xd@gmail.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
* | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2022-11-0611-60/+52
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "ARM: - Fix the pKVM stage-1 walker erronously using the stage-2 accessor - Correctly convert vcpu->kvm to a hyp pointer when generating an exception in a nVHE+MTE configuration - Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them - Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE - Document the boot requirements for FGT when entering the kernel at EL1 x86: - Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit() - Make argument order consistent for kvcalloc() - Userspace API fixes for DEBUGCTL and LBRs" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Fix a typo about the usage of kvcalloc() KVM: x86: Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit() KVM: VMX: Ignore guest CPUID for host userspace writes to DEBUGCTL KVM: VMX: Fold vmx_supported_debugctl() into vcpu_supported_debugctl() KVM: VMX: Advertise PMU LBRs if and only if perf supports LBRs arm64: booting: Document our requirements for fine grained traps with SME KVM: arm64: Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling them KVM: arm64: Fix bad dereference on MTE-enabled systems KVM: arm64: Use correct accessor to parse stage-1 PTEs
| * \ \ \ Merge tag 'kvmarm-fixes-6.1-3' of ↵Paolo Bonzini2022-11-067-36/+34
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD * Fix the pKVM stage-1 walker erronously using the stage-2 accessor * Correctly convert vcpu->kvm to a hyp pointer when generating an exception in a nVHE+MTE configuration * Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them * Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE * Document the boot requirements for FGT when entering the kernel at EL1
| | * | | | arm64: booting: Document our requirements for fine grained traps with SMEMark Brown2022-11-011-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With SME we require that fine grained traps on access to TPIDR2_EL0 and SMPRI_EL1 are disabled but did not document that fact. Add the relevant register bits. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221101112716.52035-2-broonie@kernel.org
| | * | | | KVM: arm64: Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHEMarc Zyngier2022-11-013-34/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The trapping of SMPRI_EL1 and TPIDR2_EL0 currently only really work on nVHE, as only this mode uses the fine-grained trapping that controls these two registers. Move the trapping enable/disable code into __{de,}activate_traps_common(), allowing it to be called when it actually matters on VHE, and remove the flipping of EL2 control for TPIDR2_EL0, which only affects the host access of this register. Fixes: 861262ab8627 ("KVM: arm64: Handle SME host state when running guests") Reported-by: Mark Brown <broonie@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/86bkpqer4z.wl-maz@kernel.org
| | * | | | KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling themGavin Shan2022-10-311-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two capabilities related to ring-based dirty page tracking: KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Both are supported by x86. However, arm64 supports KVM_CAP_DIRTY_LOG_RING_ACQ_REL only when the feature is supported on arm64. The userspace doesn't have to enable the advertised capability, meaning KVM_CAP_DIRTY_LOG_RING can be enabled on arm64 by userspace and it's wrong. Fix it by double checking if the capability has been advertised prior to enabling it. It's rejected to enable the capability if it hasn't been advertised. Fixes: 17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option") Reported-by: Sean Christopherson <seanjc@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221031003621.164306-4-gshan@redhat.com
| | * | | | KVM: arm64: Fix bad dereference on MTE-enabled systemsRyan Roberts2022-10-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | enter_exception64() performs an MTE check, which involves dereferencing vcpu->kvm. While vcpu has already been fixed up to be a HYP VA pointer, kvm is still a pointer in the kernel VA space. This only affects nVHE configurations with MTE enabled, as in other cases, the pointer is either valid (VHE) or not dereferenced (!MTE). Fix this by first converting kvm to a HYP VA pointer. Fixes: ea7fc1bb1cd1 ("KVM: arm64: Introduce MTE VM feature") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> [maz: commit message tidy-up] Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221027120945.29679-1-ryan.roberts@arm.com
| | * | | | KVM: arm64: Use correct accessor to parse stage-1 PTEsQuentin Perret2022-10-251-1/+1
| | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hyp_get_page_state() is used with pKVM to retrieve metadata about a page by parsing a hypervisor stage-1 PTE. However, it incorrectly uses a helper which parses *stage-2* mappings. Ouch. Luckily, pkvm_getstate() only looks at the software bits, which happen to be in the same place for stage-1 and stage-2 PTEs, and this all ends up working correctly by accident. But clearly, we should do better. Fix hyp_get_page_state() to use the correct helper. Fixes: e82edcc75c4e ("KVM: arm64: Implement do_share() helper for sharing memory") Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221025145156.855308-1-qperret@google.com