linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge tag 'gfs2-for-6.13' of ↵	Linus Torvalds	2024-11-26	9	-114/+142
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Fix the code that cleans up left-over unlinked files. Various fixes and minor improvements in deleting files cached or held open remotely. - Simplify the use of dlm's DLM_LKF_QUECVT flag. - A few other minor cleanups. * tag 'gfs2-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (21 commits) gfs2: Prevent inode creation race gfs2: Only defer deletes when we have an iopen glock gfs2: Simplify DLM_LKF_QUECVT use gfs2: gfs2_evict_inode clarification gfs2: Make gfs2_inode_refresh static gfs2: Use get_random_u32 in gfs2_orlov_skip gfs2: Randomize GLF_VERIFY_DELETE work delay gfs2: Use mod_delayed_work in gfs2_queue_try_to_evict gfs2: Update to the evict / remote delete documentation gfs2: Call gfs2_queue_verify_delete from gfs2_evict_inode gfs2: Clean up delete work processing gfs2: Minor delete_work_func cleanup gfs2: Return enum evict_behavior from gfs2_upgrade_iopen_glock gfs2: Rename dinode_demise to evict_behavior gfs2: Rename GIF_{DEFERRED -> DEFER}_DELETE gfs2: Faster gfs2_upgrade_iopen_glock wakeups KMSAN: uninit-value in inode_go_dump (5) gfs2: Fix unlinked inode cleanup gfs2: Allow immediate GLF_VERIFY_DELETE work gfs2: Initialize gl_no_formal_ino earlier ...
\| *	gfs2: Prevent inode creation race	Andreas Gruenbacher	2024-11-19	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a request to evict an inode comes in over the network, we are trying to grab an inode reference via the iopen glock's gl_object pointer. There is a very small probability that by the time such a request comes in, inode creation hasn't completed and the I_NEW flag is still set. To deal with that, wait for the inode and then check if inode creation was successful. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Only defer deletes when we have an iopen glock	Andreas Gruenbacher	2024-11-19	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The mechanism to defer deleting unlinked inodes is tied to delete_work_func(), which is tied to iopen glocks. When we don't have an iopen glock, we must carry out deletes immediately instead. Fixes a NULL pointer dereference in gfs2_evict_inode(). Fixes: 8c21c2c71e66 ("gfs2: Call gfs2_queue_verify_delete from gfs2_evict_inode") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Simplify DLM_LKF_QUECVT use	Andreas Gruenbacher	2024-11-05	1	-4/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The DLM_LKF_QUECVT flag needs to be set for "upward" lock conversions to ensure fairness, but setting it for "downward" lock conversions will lead to a failure. The flag is currently set based on the GLF_BLOCKING flag and it's not immediately obvious why this is correct. Simplify things by figuring out if a lock conversion is "upward" by looking at the before and after locking modes instead of relying on the GLF_BLOCKING flag. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: gfs2_evict_inode clarification	Andreas Gruenbacher	2024-11-05	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When function evict_should_delete() returns SHOULD_DEFER_EVICTION, gh is never initialized, but that isn't obvious; if it did initialize gh and then return SHOULD_DEFER_EVICTION, gfs2_evict_inode() would fail to release it. To clarify the code, change gfs2_evict_inode() to always check if gh needs to be released, no matter what evict_should_delete() returns. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Make gfs2_inode_refresh static	Andreas Gruenbacher	2024-11-05	2	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Function gfs2_inode_refresh() is only used in fs/gfs2/glops.c. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Use get_random_u32 in gfs2_orlov_skip	Andreas Gruenbacher	2024-11-05	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use get_random_u32() instead of get_random_bytes() to remove the last remaining call to get_random_bytes(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Randomize GLF_VERIFY_DELETE work delay	Andreas Gruenbacher	2024-11-05	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Randomize the delay of GLF_VERIFY_DELETE work. This avoids thundering herd problems when multiple nodes schedule that kind of work in response to an inode being unlinked remotely. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Use mod_delayed_work in gfs2_queue_try_to_evict	Andreas Gruenbacher	2024-11-05	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the unlikely case that we're trying to queue GLF_TRY_TO_EVICT work for an inode that already has GLF_VERIFY_DELETE work queued, we want to make sure that the GLF_TRY_TO_EVICT work gets scheduled immediately instead of waiting for the delayed work timer to expire. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Update to the evict / remote delete documentation	Andreas Gruenbacher	2024-11-05	2	-26/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Try to be a bit more clear and remove some duplications. We cannot actually get rid of the verification step eventually, so remove the comment saying so. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Call gfs2_queue_verify_delete from gfs2_evict_inode	Andreas Gruenbacher	2024-11-05	2	-12/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Move calls to gfs2_queue_verify_delete() into gfs2_evict_inode(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Clean up delete work processing	Andreas Gruenbacher	2024-11-05	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Function delete_work_func() was previously assuming that the GLF_TRY_TO_EVICT and GLF_VERIFY_DELETE flags won't both be set at the same time, but there probably are races in which that can happen, so handle that case correctly. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Minor delete_work_func cleanup	Andreas Gruenbacher	2024-11-05	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Move those definitions into the the scope in which they are used. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Return enum evict_behavior from gfs2_upgrade_iopen_glock	Andreas Gruenbacher	2024-11-05	1	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In case an iopen glock cannot be upgraded, function gfs2_upgrade_iopen_glock() needs to communicate to gfs2_evict_inode() whether deleting the inode should be deferred or skipped altogether. Change the function to return the appropriate enum evict_behavior value to indicate that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Rename dinode_demise to evict_behavior	Andreas Gruenbacher	2024-11-05	1	-18/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rename enum dinode_demise to evict_behavior and its items SHOULD_DELETE_DINODE to EVICT_SHOULD_DELETE, SHOULD_NOT_DELETE_DINODE to EVICT_SHOULD_SKIP_DELETE, and SHOULD_DEFER_EVICTION to EVICT_SHOULD_DEFER_DELETE. In gfs2_evict_inode(), add a separate variable of type enum evict_behavior instead of implicitly casting to int. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Rename GIF_{DEFERRED -> DEFER}_DELETE	Andreas Gruenbacher	2024-11-05	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The GIF_DEFERRED_DELETE flag indicates an action that gfs2_evict_inode() should take, so rename the flag to GIF_DEFER_DELETE to clarify. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Faster gfs2_upgrade_iopen_glock wakeups	Andreas Gruenbacher	2024-11-05	3	-18/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move function needs_demote() to glock.h and rename it to glock_needs_demote(). In handle_callback(), wake up the glock when setting the GLF_PENDING_DEMOTE flag as well. (Setting the GLF_DEMOTE flag already triggered a wake-up.) With that, check for glock_needs_demote() in gfs2_upgrade_iopen_glock() to wake up when either of those flags is set for the inode glock: the faster we can react to contention, the better. The GLF_PENDING_DEMOTE flag is only used for inode glocks (see gfs2_glock_cb()) so it's okay to only check for the GLF_DEMOTE flag in gfs2_drop_inode(). Still, using glock_needs_demote() there as well makes the code a little easier to read. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	KMSAN: uninit-value in inode_go_dump (5)	Qianqiang Liu	2024-10-22	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When mounting of a corrupted disk image fails, the error message printed can reference uninitialized inode fields. To prevent that from happening, always initialize those fields. Reported-by: syzbot+aa0730b0a42646eb1359@syzkaller.appspotmail.com Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Fix unlinked inode cleanup	Andreas Gruenbacher	2024-09-25	4	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before commit f0e56edc2ec7 ("gfs2: Split the two kinds of glock "delete" work"), function delete_work_func() was used to trigger the eviction of in-memory inodes from remote as well as deleting unlinked inodes at a later point. These two kinds of work were then split into two kinds of work, and the two places in the code were deferred deletion of inodes is required accidentally ended up queuing the wrong kind of work. This caused unlinked inodes to be left behind, which could in the worst case fill up filesystems and require a filesystem check to recover. Fix that by queuing the right kind of work in try_rgrp_unlink() and gfs2_drop_inode(). Fixes: f0e56edc2ec7 ("gfs2: Split the two kinds of glock "delete" work") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Allow immediate GLF_VERIFY_DELETE work	Andreas Gruenbacher	2024-09-25	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add an argument to gfs2_queue_verify_delete() that allows it to queue GLF_VERIFY_DELETE work for immediate execution. This is used in the next patch. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Initialize gl_no_formal_ino earlier	Andreas Gruenbacher	2024-09-24	3	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Set gl_no_formal_ino of the iopen glock to the generation of the associated inode (ip->i_no_formal_ino) as soon as that value is known. This saves us from setting it later, possibly repeatedly, when queuing GLF_VERIFY_DELETE work. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| *	gfs2: Rename GLF_VERIFY_EVICT to GLF_VERIFY_DELETE	Andreas Gruenbacher	2024-09-24	2	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rename the GLF_VERIFY_EVICT flag to GLF_VERIFY_DELETE: that flag indicates that we want to delete an inode / verify that it has been deleted. To match, rename gfs2_queue_verify_evict() to gfs2_queue_verify_delete(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
* \|	Merge tag 'mm-stable-2024-11-18-19-27' of ↵	Linus Torvalds	2024-11-23	1	-1/+1
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro\|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...
\| * \|	mm/list_lru: simplify the list_lru walk callback function	Kairui Song	2024-11-12	1	-1/+1
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one being walked, no longer needed to pass the lock explicitly. Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* \|	Merge tag 'vfs-6.13.file' of ↵	Linus Torvalds	2024-11-18	1	-10/+2
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file updates from Christian Brauner: "This contains changes the changes for files for this cycle: - Introduce a new reference counting mechanism for files. As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has O(N^2) behaviour under contention with N concurrent operations and it is in a hot path in __fget_files_rcu(). The rcuref infrastructures remedies this problem by using an unconditional increment relying on safe- and dead zones to make this work and requiring rcu protection for the data structure in question. This not just scales better it also introduces overflow protection. However, in contrast to generic rcuref, files require a memory barrier and thus cannot rely on _relaxed() atomic operations and also require to be built on atomic_long_t as having massive amounts of reference isn't unheard of even if it is just an attack. This adds a file specific variant instead of making this a generic library. This has been tested by various people and it gives consistent improvement up to 3-5% on workloads with loads of threads. - Add a fastpath for find_next_zero_bit(). Skip 2-levels searching via find_next_zero_bit() when there is a free slot in the word that contains the next fd. This improves pts/blogbench-1.1.0 read by 8% and write by 4% on Intel ICX 160. - Conditionally clear full_fds_bits since it's very likely that a bit in full_fds_bits has been cleared during __clear_open_fds(). This improves pts/blogbench-1.1.0 read up to 13%, and write up to 5% on Intel ICX 160. - Get rid of all lookup__fdget_rcu() variants. They were used to lookup files without taking a reference count. That became invalid once files were switched to SLAB_TYPESAFE_BY_RCU and now we're always taking a reference count. Switch to an already existing helper and remove the legacy variants. - Remove pointless includes of <linux/fdtable.h>. - Avoid cmpxchg() in close_files() as nobody else has a reference to the files_struct at that point. - Move close_range() into fs/file.c and fold __close_range() into it. - Cleanup calling conventions of alloc_fdtable() and expand_files(). - Merge __{set,clear}_close_on_exec() into one. - Make __set_open_fd() set cloexec as well instead of doing it in two separate steps" * tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor fs: port files to file_ref fs: add file_ref expand_files(): simplify calling conventions make __set_open_fd() set cloexec state as well fs: protect backing files with rcu file.c: merge __{set,clear}_close_on_exec() alloc_fdtable(): change calling conventions. fs/file.c: add fast path in find_next_fd() fs/file.c: conditionally clear full_fds fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() move close_range(2) into fs/file.c, fold __close_range() into it close_files(): don't bother with xchg() remove pointless includes of <linux/fdtable.h> get rid of ...lookup...fdget_rcu() family
\| * \|	get rid of ...lookup...fdget_rcu() family	Al Viro	2024-10-07	1	-10/+2
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Once upon a time, predecessors of those used to do file lookup without bumping a refcount, provided that caller held rcu_read_lock() across the lookup and whatever it wanted to read from the struct file found. When struct file allocation switched to SLAB_TYPESAFE_BY_RCU, that stopped being feasible and these primitives started to bump the file refcount for lookup result, requiring the caller to call fput() afterwards. But that turned them pointless - e.g. rcu_read_lock(); file = lookup_fdget_rcu(fd); rcu_read_unlock(); is equivalent to file = fget_raw(fd); and all callers of lookup_fdget_rcu() are of that form. Similarly, task_lookup_fdget_rcu() calls can be replaced with calling fget_task(). task_lookup_next_fdget_rcu() doesn't have direct counterparts, but its callers would be happier if we replaced it with an analogue that deals with RCU internally. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \|	Merge patch series "Fixup NLM and kNFSD file lock callbacks"	Christian Brauner	2024-10-02	2	-1/+2
\|\ \ \| \|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Benjamin Coddington <bcodding@redhat.com> says: Last year both GFS2 and OCFS2 had some work done to make their locking more robust when exported over NFS. Unfortunately, part of that work caused both NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send lock notifications to clients. This in itself is not a huge problem because most NFS clients will still poll the server in order to acquire a conflicted lock, but now that I've noticed it I can't help but try to fix it because there are big advantages for setups that might depend on timely lock notifications, and we've supported that as a feature for a long time. Its important for NLM and kNFSD that they do not block their kernel threads inside filesystem's file_lock implementations because that can produce deadlocks. We used to make sure of this by only trusting that posix_lock_file() can correctly handle blocking lock calls asynchronously, so the lock managers would only setup their file_lock requests for async callbacks if the filesystem did not define its own lock() file operation. However, when GFS2 and OCFS2 grew the capability to correctly handle blocking lock requests asynchronously, they started signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting posix_lock_file() was inadvertently dropped, so now most filesystems no longer produce lock notifications when exported over NFS. I tried to fix this by simply including the old check for lock(), but the resulting include mess and layering violations was more than I could accept. There's a much cleaner way presented here using an fop_flag, which while potentially flag-greedy, greatly simplifies the problem and grooms the way for future uses by both filesystems and lock managers alike. * patches from https://lore.kernel.org/r/cover.1726083391.git.bcodding@redhat.com: exportfs: Remove EXPORT_OP_ASYNC_LOCK NLM/NFSD: Fix lock notifications for async-capable filesystems gfs2/ocfs2: set FOP_ASYNC_LOCK fs: Introduce FOP_ASYNC_LOCK NFS: trace: show TIMEDOUT instead of 0x6e nfsd: use system_unbound_wq for nfsd_file_gc_worker() nfsd: count nfsd_file allocations nfsd: fix refcount leak when file is unhashed after being found nfsd: remove unneeded EEXIST error check in nfsd_do_file_acquire nfsd: add list_head nf_gc to struct nfsd_file Link: https://lore.kernel.org/r/cover.1726083391.git.bcodding@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
\| *	exportfs: Remove EXPORT_OP_ASYNC_LOCK	Benjamin Coddington	2024-10-01	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that GFS2 and OCFS2 are signalling async ->lock() support with FOP_ASYNC_LOCK and checks for support are converted, we can remove EXPORT_OP_ASYNC_LOCK. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/0a114db814fec3086f937ae3d44a086f13b8de26.1726083391.git.bcodding@redhat.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
\| *	gfs2/ocfs2: set FOP_ASYNC_LOCK	Benjamin Coddington	2024-09-12	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both GFS2 and OCFS2 use DLM locking, which will allow async lock requests. Signal this support by setting FOP_ASYNC_LOCK. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/fc4163dbbf33c58e5a8b8ee8cb8c57e555f53ce5.1726083391.git.bcodding@redhat.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* \|	Merge tag 'gfs2-v6.10-fixes' of ↵	Linus Torvalds	2024-09-23	5	-51/+27
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 update from Andreas Gruenbacher: - Convert the writepage address space operation to writepages (Matthew Wilcox) - A syzkaller fix (by Julian Sun) and a minor cleanup (Andreas Gruenbacher) * tag 'gfs2-v6.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Remove gfs2_aspace_writepage() gfs2: Remove gfs2_jdata_writepage() gfs2: Remove __gfs2_writepage() gfs2: Add gfs2_aspace_writepages() gfs2: fix double destroy_workqueue error gfs2: Minor gfs2_glock_cb cleanup
\| * \|	gfs2: Remove gfs2_aspace_writepage()	Matthew Wilcox (Oracle)	2024-09-02	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are no remaining callers of gfs2_aspace_writepage() other than vmscan, which is known to do more harm than good. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| * \|	gfs2: Remove gfs2_jdata_writepage()	Matthew Wilcox (Oracle)	2024-09-02	1	-30/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are no remaining callers of gfs2_jdata_writepage() other than vmscan, which is known to do more harm than good. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| * \|	gfs2: Remove __gfs2_writepage()	Matthew Wilcox (Oracle)	2024-09-02	1	-10/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Call aops->writepages() instead of using write_cache_pages() to call aops->writepage. Change the handling of -ENODATA to not set the persistent error on the block device. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| * \|	gfs2: Add gfs2_aspace_writepages()	Matthew Wilcox (Oracle)	2024-09-02	1	-5/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This saves one indirect function call per folio and gets us closer to removing aops->writepage. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| * \|	gfs2: fix double destroy_workqueue error	Julian Sun	2024-08-20	2	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When gfs2_fill_super() fails, destroy_workqueue() is called within gfs2_gl_hash_clear(), and the subsequent code path calls destroy_workqueue() on the same work queue again. This issue can be fixed by setting the work queue pointer to NULL after the first destroy_workqueue() call and checking for a NULL pointer before attempting to destroy the work queue again. Reported-by: syzbot+d34c2a269ed512c531b0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d34c2a269ed512c531b0 Fixes: 30e388d57367 ("gfs2: Switch to a per-filesystem glock workqueue") Cc: stable@vger.kernel.org Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
\| * \|	gfs2: Minor gfs2_glock_cb cleanup	Andreas Gruenbacher	2024-08-20	1	-3/+5
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \|	In gfs2_glock_cb(), we only need to calculate the glock hold time for inode glocks; the value is unused otherwise. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
* /	iomap: add a private argument for iomap_file_buffered_write	Josef Bacik	2024-09-03	1	-1/+1
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In order to switch fuse over to using iomap for buffered writes we need to be able to have the struct file for the original write, in case we have to read in the page to make it uptodate. Handle this by using the existing private field in the iomap_iter, and add the argument to iomap_file_buffered_write. This will allow us to pass the file in through the iomap buffered write path, and is flexible for any other file systems needs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/7f55c7c32275004ba00cddf862d970e6e633f750.1724755651.git.josef@toxicpanda.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
*	gfs2: Clean up glock demote logic	Andreas Gruenbacher	2024-07-09	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \|	The logic for determining when to demote a glock in glock_work_func(), introduced in commit 7cf8dcd3b68a ("GFS2: Automatically adjust glock min hold time"), doesn't make sense: inode glocks have a minimum hold time that delays demotion, while all other glocks are expected to be demoted immediately. Instead of demoting non-inode glocks immediately, glock_work_func() schedules glock work for them to be demoted, however. Get rid of that unnecessary indirection. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Revert "check for no eligible quota changes"	Andreas Gruenbacher	2024-06-20	1	-20/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	Since the previous commit, function gfs2_quota_sync() will not cause the sync generation to creep forward by one every time the function is called; this helps keep things a but more tidy. We also don't care that this function allocates a page of memory every time it is called, so no good reason for keeping qd_changed() anymore, which just duplicates qd_grab_sync(). This reverts commit 06aa6fd31a5f402b055e12ea53bb7b086359d3c8. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Be more careful with the quota sync generation	Andreas Gruenbacher	2024-06-20	1	-8/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The quota sync generation is only ever updated under sd_quota_sync_mutex by gfs2_quota_sync(), but its current value is fetched ouside of that mutex, so use WRITE_ONCE() and READ_ONCE() when accessing it without holding that mutex. Pass the current sync generation to do_sync() from its callers to ensure that we're not recording the wrong generation when the syncing is done. Also, make sure that qd->qd_sync_gen only ever moves forward. In gfs2_quota_sync(), only write the new sync generation when we know that there are changes. This eliminates the need for function sd_changed(), which we will remove in the next commit. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Get rid of some unnecessary quota locking	Andreas Gruenbacher	2024-06-20	3	-28/+27
\| \| \| \| \| \| \| \| \|	With the locking the previous patch has introduced for each struct gfs2_quota_data object, sd_quota_mutex has become largely irrelevant. By waiting on the buffer head instead of waiting on the mutex in get_bh(), it becomes completely irrelevant and can be removed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Add some missing quota locking	Andreas Gruenbacher	2024-06-12	1	-29/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The quota code is missing some locking between local quota changes and syncing those quota changes to the global quota file (gfs2_quota_sync); in particular, qd->qd_change needs to be kept in sync with the QDF_CHANGE change flag and the number of references held. Use the qd->qd_lockref.lock spinlock for that. With the qd->qd_lockref.lock spinlock held, we can no longer call lockref_get(), so turn qd_hold() into a variant that assumes that the lock is held. This function is really supposed to take an additional reference when one or more references are already held, so check for that instead of checking if the lockref is dead. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Fold qd_fish into gfs2_quota_sync	Andreas Gruenbacher	2024-06-08	1	-48/+29
\| \| \| \| \| \| \| \| \| \| \|	The split between qd_fish() and gfs2_quota_sync() is rather unfortunate as qd_fish() is repeatedly called to scan sdp->sd_quota_list only to find the next object to that needs syncing; if there are multiple objects on the list that need syncing, it makes more sense to grab them all in one go. This is relatively easy to do when qd_fish() is folded into gfs2_quota_sync(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: quota need_sync cleanup	Andreas Gruenbacher	2024-06-08	1	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rename variable 'value' to 'change' as it stores a change in value. Add new 'value' and 'limit' variables for the current value and limit. Only fetch the tuning parameters when we need them. Get rid of unnecessary nesting. No change in functionality. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Fix and clean up function do_qc	Andreas Gruenbacher	2024-06-08	1	-13/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Function do_qc() is supposed to be conceptually simple: it alters the current in-memory and on-disk quota change values for a given uid/gid by a given delta. If the on-disk record isn't defined yet, a new record is created. If the on-disk record exists and the resulting change value is zero, there no longer is a need for that record and so the record is deleted. On top of that, some reference counting is involved when creating and deleting records. Currently, instead of doing the above, do_qc() alters the on-disk value and then it sets the in-memory value to the on-disk value. This is incorrect when the on-disk value differs from the in-memory value. The two values are allowed to differ when quota changes are synced to the global quota file. Fix by changing both values by the same amount. In addition, do_qc() currently gets confused when the delta value is 0. It isn't supposed to be called that way, but that assumption isn't mentioned and it makes the code harder to read. Make the code more explicit. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Revert "Add quota_change type"	Andreas Gruenbacher	2024-06-08	2	-15/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 432928c93779 ("gfs2: Add quota_change type") makes the incorrect assertion that function do_qc() should behave differently in the two contexts it is used in, but that isn't actually true. In all cases, do_qc() grabs a "reference" when it starts using a slot in the per-node quota changes file, and it releases that "reference" when no more residual changes remain. Revert that broken commit. There are some remaining issues with function do_qc() which are addressed in the next commit. This reverts commit 432928c9377959684c748a9bc6553ed2d3c2ea4f. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Revert "ignore negated quota changes"	Andreas Gruenbacher	2024-06-08	1	-11/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 4c6a08125f22 ("gfs2: ignore negated quota changes") skips quota changes with qd_change == 0 instead of writing them back, which leaves behind non-zero qd_change values in the affected slots. The kernel then assumes that those slots are unused, while the qd_change values on disk indicate that they are indeed still in use. The next time the filesystem is mounted, those invalid slots are read in from disk, which will cause inconsistencies. Revert that commit to avoid filesystem corruption. This reverts commit 4c6a08125f2249531ec01783a5f4317d7342add5. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: qd_check_sync cleanups	Andreas Gruenbacher	2024-06-08	1	-18/+22
\| \| \| \| \| \| \| \| \| \| \|	Rename qd_check_sync() to qd_grab_sync() and make it return a bool. Turn the sync_gen pointer into a regular u64 and pass in U64_MAX instead of a NULL pointer when sync generation checking isn't needed. Introduce a new qd_ungrab_sync() helper for undoing the effects of qd_grab_sync() if the subsequent bh_get() on the qd object fails. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Revert "introduce qd_bh_get_or_undo"	Andreas Gruenbacher	2024-06-08	1	-19/+17
\| \| \| \| \| \| \| \| \| \|	The qd_bh_get_or_undo() helper introduced by that commit doesn't improve the code much, so revert it and clean things up in a more useful way in the next commit. This reverts commit 7dbc6ae60dd7089d8ed42892b6a66c138f0aa7a0. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
*	gfs2: Check quota consistency on mount	Andreas Gruenbacher	2024-06-07	1	-6/+31
\| \| \| \| \| \| \| \| \| \| \| \| \|	In gfs2_quota_init(), make sure that the per-node "quota_change%u" file doesn't contain duplicate uids/gids. Those duplicates would cause us to acquire the glock corresponding to those ids repeatedly, which the glock code doesn't allow. When finding inconsistencies, we wipe them out and ignore them. The resulting quotas will likely be inconsistent, and running quotacheck(1) is advised. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>