summaryrefslogtreecommitdiffstats
path: root/fs/nfsd (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'nfsd-6.2-2' of ↵Linus Torvalds2023-01-042-1/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a filecache UAF during NFSD shutdown - Avoid exposing automounted mounts on NFS re-exports * tag 'nfsd-6.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix handling of readdir in v4root vs. mount upcall timeout nfsd: shut down the NFSv4 state objects before the filecache
| * nfsd: fix handling of readdir in v4root vs. mount upcall timeoutJeff Layton2023-01-021-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If v4 READDIR operation hits a mountpoint and gets back an error, then it will include that entry in the reply and set RDATTR_ERROR for it to the error. That's fine for "normal" exported filesystems, but on the v4root, we need to be more careful to only expose the existence of dentries that lead to exports. If the mountd upcall times out while checking to see whether a mountpoint on the v4root is exported, then we have no recourse other than to fail the whole operation. Cc: Steve Dickson <steved@redhat.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216777 Reported-by: JianHong Yin <yin-jianhong@163.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org>
| * nfsd: shut down the NFSv4 state objects before the filecacheJeff Layton2022-12-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we shut down the filecache before trying to clean up the stateids that depend on it. This leads to the kernel trying to free an nfsd_file twice, and a refcount overput on the nf_mark. Change the shutdown procedure to tear down all of the stateids prior to shutting down the filecache. Reported-and-tested-by: Wang Yugui <wangyugui@e16-tech.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Fixes: 5e113224c17e ("nfsd: nfsd_file cache entries should be per net namespace") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | Merge tag 'nfsd-6.2-1' of ↵Linus Torvalds2022-12-194-196/+197
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull more nfsd updates from Chuck Lever: "This contains a number of crasher fixes that were not ready for the initial pull request last week. In particular, Jeff's patch attempts to address reference count underflows in NFSD's filecache, which have been very difficult to track down because there is no reliable reproducer. Common failure modes: https://bugzilla.kernel.org/show_bug.cgi?id=216691#c11 https://bugzilla.kernel.org/show_bug.cgi?id=216674#c6 https://bugzilla.redhat.com/show_bug.cgi?id=2138605 The race windows were found by inspection and the clean-ups appear sensible and pass regression testing, so we include them here in the hope that they address the problem. However we remain vigilant because we don't have 100% certainty yet that the problem is fully addressed. Summary: - Address numerous reports of refcount underflows in NFSD's filecache - Address a UAF in callback setup error handling - Address a UAF during server-to-server copy" * tag 'nfsd-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: fix use-after-free in __nfs42_ssc_open() nfsd: under NFSv4.1, fix double svc_xprt_put on rpc_create failure nfsd: rework refcounting in filecache
| * NFSD: fix use-after-free in __nfs42_ssc_open()Dai Ngo2022-12-141-15/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem caused by source's vfsmount being unmounted but remains on the delayed unmount list. This happens when nfs42_ssc_open() return errors. Fixed by removing nfsd4_interssc_connect(), leave the vfsmount for the laundromat to unmount when idle time expires. We don't need to call nfs_do_sb_deactive when nfs42_ssc_open return errors since the file was not opened so nfs_server->active was not incremented. Same as in nfsd4_copy, if we fail to launch nfsd4_do_async_copy thread then there's no need to call nfs_do_sb_deactive Reported-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Tested-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * nfsd: under NFSv4.1, fix double svc_xprt_put on rpc_create failureDan Aloni2022-12-121-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On error situation `clp->cl_cb_conn.cb_xprt` should not be given a reference to the xprt otherwise both client cleanup and the error handling path of the caller call to put it. Better to delay handing over the reference to a later branch. [ 72.530665] refcount_t: underflow; use-after-free. [ 72.531933] WARNING: CPU: 0 PID: 173 at lib/refcount.c:28 refcount_warn_saturate+0xcf/0x120 [ 72.533075] Modules linked in: nfsd(OE) nfsv4(OE) nfsv3(OE) nfs(OE) lockd(OE) compat_nfs_ssc(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) rpcrdma(OE) dns_resolver fscache netfs grace rdma_cm iw_cm ib_cm sunrpc(OE) mlx5_ib mlx5_core mlxfw pci_hyperv_intf ib_uverbs ib_core xt_MASQUERADE nf_conntrack_netlink nft_counter xt_addrtype nft_compat br_netfilter bridge stp llc nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set overlay nf_tables nfnetlink crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xfs serio_raw virtio_net virtio_blk net_failover failover fuse [last unloaded: sunrpc] [ 72.540389] CPU: 0 PID: 173 Comm: kworker/u16:5 Tainted: G OE 5.15.82-dan #1 [ 72.541511] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-3.module+el8.7.0+1084+97b81f61 04/01/2014 [ 72.542717] Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd] [ 72.543575] RIP: 0010:refcount_warn_saturate+0xcf/0x120 [ 72.544299] Code: 55 00 0f 0b 5d e9 01 50 98 00 80 3d 75 9e 39 08 00 0f 85 74 ff ff ff 48 c7 c7 e8 d1 60 8e c6 05 61 9e 39 08 01 e8 f6 51 55 00 <0f> 0b 5d e9 d9 4f 98 00 80 3d 4b 9e 39 08 00 0f 85 4c ff ff ff 48 [ 72.546666] RSP: 0018:ffffb3f841157cf0 EFLAGS: 00010286 [ 72.547393] RAX: 0000000000000026 RBX: ffff89ac6231d478 RCX: 0000000000000000 [ 72.548324] RDX: ffff89adb7c2c2c0 RSI: ffff89adb7c205c0 RDI: ffff89adb7c205c0 [ 72.549271] RBP: ffffb3f841157cf0 R08: 0000000000000000 R09: c0000000ffefffff [ 72.550209] R10: 0000000000000001 R11: ffffb3f841157ad0 R12: ffff89ac6231d180 [ 72.551142] R13: ffff89ac6231d478 R14: ffff89ac40c06180 R15: ffff89ac6231d4b0 [ 72.552089] FS: 0000000000000000(0000) GS:ffff89adb7c00000(0000) knlGS:0000000000000000 [ 72.553175] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 72.553934] CR2: 0000563a310506a8 CR3: 0000000109a66000 CR4: 0000000000350ef0 [ 72.554874] Call Trace: [ 72.555278] <TASK> [ 72.555614] svc_xprt_put+0xaf/0xe0 [sunrpc] [ 72.556276] nfsd4_process_cb_update.isra.11+0xb7/0x410 [nfsd] [ 72.557087] ? update_load_avg+0x82/0x610 [ 72.557652] ? cpuacct_charge+0x60/0x70 [ 72.558212] ? dequeue_entity+0xdb/0x3e0 [ 72.558765] ? queued_spin_unlock+0x9/0x20 [ 72.559358] nfsd4_run_cb_work+0xfc/0x270 [nfsd] [ 72.560031] process_one_work+0x1df/0x390 [ 72.560600] worker_thread+0x37/0x3b0 [ 72.561644] ? process_one_work+0x390/0x390 [ 72.562247] kthread+0x12f/0x150 [ 72.562710] ? set_kthread_struct+0x50/0x50 [ 72.563309] ret_from_fork+0x22/0x30 [ 72.563818] </TASK> [ 72.564189] ---[ end trace 031117b1c72ec616 ]--- [ 72.566019] list_add corruption. next->prev should be prev (ffff89ac4977e538), but was ffff89ac4763e018. (next=ffff89ac4763e018). [ 72.567647] ------------[ cut here ]------------ Fixes: a4abc6b12eb1 ("nfsd: Fix svc_xprt refcnt leak when setup callback client failed") Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * nfsd: rework refcounting in filecacheJeff Layton2022-12-112-180/+189
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The filecache refcounting is a bit non-standard for something searchable by RCU, in that we maintain a sentinel reference while it's hashed. This in turn requires that we have to do things differently in the "put" depending on whether its hashed, which we believe to have led to races. There are other problems in here too. nfsd_file_close_inode_sync can end up freeing an nfsd_file while there are still outstanding references to it, and there are a number of subtle ToC/ToU races. Rework the code so that the refcount is what drives the lifecycle. When the refcount goes to zero, then unhash and rcu free the object. A task searching for a nfsd_file is allowed to bump its refcount, but only if it's not already 0. Ensure that we don't make any other changes to it until a reference is held. With this change, the LRU carries a reference. Take special care to deal with it when removing an entry from the list, and ensure that we only repurpose the nf_lru list_head when the refcount is 0 to ensure exclusive access to it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | Merge tag 'hardening-v6.2-rc1' of ↵Linus Torvalds2022-12-141-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: - Convert flexible array members, fix -Wstringop-overflow warnings, and fix KCFI function type mismatches that went ignored by maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook) - Remove the remaining side-effect users of ksize() by converting dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add more __alloc_size attributes, and introduce full testing of all allocator functions. Finally remove the ksize() side-effect so that each allocation-aware checker can finally behave without exceptions - Introduce oops_limit (default 10,000) and warn_limit (default off) to provide greater granularity of control for panic_on_oops and panic_on_warn (Jann Horn, Kees Cook) - Introduce overflows_type() and castable_to_type() helpers for cleaner overflow checking - Improve code generation for strscpy() and update str*() kern-doc - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests - Always use a non-NULL argument for prepare_kernel_cred() - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell) - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin Li) - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu) - Fix um vs FORTIFY warnings for always-NULL arguments * tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits) ksmbd: replace one-element arrays with flexible-array members hpet: Replace one-element array with flexible-array member um: virt-pci: Avoid GCC non-NULL warning signal: Initialize the info in ksignal lib: fortify_kunit: build without structleak plugin panic: Expose "warn_count" to sysfs panic: Introduce warn_limit panic: Consolidate open-coded panic_on_warn checks exit: Allow oops_limit to be disabled exit: Expose "oops_count" to sysfs exit: Put an upper limit on how often we can oops panic: Separate sysctl logic from CONFIG_SMP mm/pgtable: Fix multiple -Wstringop-overflow warnings mm: Make ksize() a reporting-only function kunit/fortify: Validate __alloc_size attribute results drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid() drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid() driver core: Add __alloc_size hint to devm allocators overflow: Introduce overflows_type() and castable_to_type() coredump: Proactively round up to kmalloc bucket size ...
| * | cred: Do not default to init_cred in prepare_kernel_cred()Kees Cook2022-11-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A common exploit pattern for ROP attacks is to abuse prepare_kernel_cred() in order to construct escalated privileges[1]. Instead of providing a short-hand argument (NULL) to the "daemon" argument to indicate using init_cred as the base cred, require that "daemon" is always set to an actual task. Replace all existing callers that were passing NULL with &init_task. Future attacks will need to have sufficiently powerful read/write primitives to have found an appropriately privileged task and written it to the ROP stack as an argument to succeed, which is similarly difficult to the prior effort needed to escalate privileges before struct cred existed: locate the current cred and overwrite the uid member. This has the added benefit of meaning that prepare_kernel_cred() can no longer exceed the privileges of the init task, which may have changed from the original init_cred (e.g. dropping capabilities from the bounding set). [1] https://google.com/search?q=commit_creds(prepare_kernel_cred(0)) Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: David Howells <dhowells@redhat.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Steve French <sfrench@samba.org> Cc: Ronnie Sahlberg <lsahlber@redhat.com> Cc: Shyam Prasad N <sprasad@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: linux-nfs@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Russ Weight <russell.h.weight@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Link: https://lore.kernel.org/r/20221026232943.never.775-kees@kernel.org
* | | Merge tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds2022-12-1327-730/+1173
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Chuck Lever: "This release introduces support for the CB_RECALL_ANY operation. NFSD can send this operation to request that clients return any delegations they choose. The server uses this operation to handle low memory scenarios or indicate to a client when that client has reached the maximum number of delegations the server supports. The NFSv4.2 READ_PLUS operation has been simplified temporarily whilst support for sparse files in local filesystems and the VFS is improved. Two major data structure fixes appear in this release: - The nfs4_file hash table is replaced with a resizable hash table to reduce the latency of NFSv4 OPEN operations. - Reference counting in the NFSD filecache has been hardened against races. In furtherance of removing support for NFSv2 in a subsequent kernel release, a new Kconfig option enables server-side support for NFSv2 to be left out of a kernel build. MAINTAINERS has been updated to indicate that changes to fs/exportfs should go through the NFSD tree" * tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (49 commits) NFSD: Avoid clashing function prototypes SUNRPC: Fix crasher in unwrap_integ_data() SUNRPC: Make the svc_authenticate tracepoint conditional NFSD: Use only RQ_DROPME to signal the need to drop a reply SUNRPC: Clean up xdr_write_pages() SUNRPC: Don't leak netobj memory when gss_read_proxy_verf() fails NFSD: add CB_RECALL_ANY tracepoints NFSD: add delegation reaper to react to low memory condition NFSD: add support for sending CB_RECALL_ANY NFSD: refactoring courtesy_client_reaper to a generic low memory shrinker trace: Relocate event helper files NFSD: pass range end to vfs_fsync_range() instead of count lockd: fix file selection in nlmsvc_cancel_blocked lockd: ensure we use the correct file descriptor when unlocking lockd: set missing fl_flags field when retrieving args NFSD: Use struct_size() helper in alloc_session() nfsd: return error if nfs4_setacl fails lockd: set other missing fields when unlocking files NFSD: Add an nfsd_file_fsync tracepoint sunrpc: svc: Remove an unused static function svc_ungetu32() ...
| * | NFSD: Avoid clashing function prototypesKees Cook2022-12-101-255/+377
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When built with Control Flow Integrity, function prototypes between caller and function declaration must match. These mismatches are visible at compile time with the new -Wcast-function-type-strict in Clang[1]. There were 97 warnings produced by NFS. For example: fs/nfsd/nfs4xdr.c:2228:17: warning: cast from '__be32 (*)(struct nfsd4_compoundargs *, struct nfsd4_access *)' (aka 'unsigned int (*)(struct nfsd4_compoundargs *, struct nfsd4_access *)') to 'nfsd4_dec' (aka 'unsigned int (*)(struct nfsd4_compoundargs *, void *)') converts to incompatible function type [-Wcast-function-type-strict] [OP_ACCESS] = (nfsd4_dec)nfsd4_decode_access, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The enc/dec callbacks were defined as passing "void *" as the second argument, but were being implicitly cast to a new type. Replace the argument with union nfsd4_op_u, and perform explicit member selection in the function body. There are no resulting binary differences. Changes were made mechanically using the following Coccinelle script, with minor by-hand fixes for members that didn't already match their existing argument name: @find@ identifier func; type T, opsT; identifier ops, N; @@ opsT ops[] = { [N] = (T) func, }; @already_void@ identifier find.func; identifier name; @@ func(..., -void +union nfsd4_op_u *name) { ... } @proto depends on !already_void@ identifier find.func; type T; identifier name; position p; @@ func@p(..., T name ) { ... } @script:python get_member@ type_name << proto.T; member; @@ coccinelle.member = cocci.make_ident(type_name.split("_", 1)[1].split(' ',1)[0]) @convert@ identifier find.func; type proto.T; identifier proto.name; position proto.p; identifier get_member.member; @@ func@p(..., - T name + union nfsd4_op_u *u ) { + T name = &u->member; ... } @cast@ identifier find.func; type T, opsT; identifier ops, N; @@ opsT ops[] = { [N] = - (T) func, }; Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: linux-nfs@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: Use only RQ_DROPME to signal the need to drop a replyChuck Lever2022-12-102-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: NFSv2 has the only two usages of rpc_drop_reply in the NFSD code base. Since NFSv2 is going away at some point, replace these in order to simplify the "drop this reply?" check in nfsd_dispatch(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: add CB_RECALL_ANY tracepointsDai Ngo2022-12-102-0/+52
| | | | | | | | | | | | | | | | | | | | | | | | Add tracepoints to trace start and end of CB_RECALL_ANY operation. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> [ cel: added show_rca_mask() macro ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: add delegation reaper to react to low memory conditionDai Ngo2022-12-102-4/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The delegation reaper is called by nfsd memory shrinker's on the 'count' callback. It scans the client list and sends the courtesy CB_RECALL_ANY to the clients that hold delegations. To avoid flooding the clients with CB_RECALL_ANY requests, the delegation reaper sends only one CB_RECALL_ANY request to each client per 5 seconds. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> [ cel: moved definition of RCA4_TYPE_MASK_RDATA_DLG ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: add support for sending CB_RECALL_ANYDai Ngo2022-12-104-0/+84
| | | | | | | | | | | | | | | | | | | | | Add XDR encode and decode function for CB_RECALL_ANY. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: refactoring courtesy_client_reaper to a generic low memory shrinkerDai Ngo2022-12-101-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | Refactoring courtesy_client_reaper to generic low memory shrinker so it can be used for other purposes. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: pass range end to vfs_fsync_range() instead of countBrian Foster2022-12-101-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | _nfsd_copy_file_range() calls vfs_fsync_range() with an offset and count (bytes written), but the former wants the start and end bytes of the range to sync. Fix it up. Fixes: eac0b17a77fb ("NFSD add vfs_fsync after async copy is done") Signed-off-by: Brian Foster <bfoster@redhat.com> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: Use struct_size() helper in alloc_session()Xiu Jianfeng2022-12-101-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | Use struct_size() helper to simplify the code, no functional changes. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | nfsd: return error if nfs4_setacl failsJeff Layton2022-12-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the addition of POSIX ACLs to struct nfsd_attrs, we no longer return an error if setting the ACL fails. Ensure we return the na_aclerr error on SETATTR if there is one. Fixes: c0cbe70742f4 ("NFSD: add posix ACLs to struct nfsd_attrs") Cc: Neil Brown <neilb@suse.de> Reported-by: Yongcheng Yang <yoyang@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: Add an nfsd_file_fsync tracepointChuck Lever2022-12-102-1/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a tracepoint to capture the number of filecache-triggered fsync calls and which files needed it. Also, record when an fsync triggers a write verifier reset. Examples: <...>-97 [007] 262.505611: nfsd_file_free: inode=0xffff888171e08140 ref=0 flags=GC may=WRITE nf_file=0xffff8881373d2400 <...>-97 [007] 262.505612: nfsd_file_fsync: inode=0xffff888171e08140 ref=0 flags=GC may=WRITE nf_file=0xffff8881373d2400 ret=0 <...>-97 [007] 262.505623: nfsd_file_free: inode=0xffff888171e08dc0 ref=0 flags=GC may=WRITE nf_file=0xffff8881373d1e00 <...>-97 [007] 262.505624: nfsd_file_fsync: inode=0xffff888171e08dc0 ref=0 flags=GC may=WRITE nf_file=0xffff8881373d1e00 ret=0 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | nfsd: fix up the filecache laundrette schedulingJeff Layton2022-12-101-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't really care whether there are hashed entries when it comes to scheduling the laundrette. They might all be non-gc entries, after all. We only want to schedule it if there are entries on the LRU. Switch to using list_lru_count, and move the check into nfsd_file_gc_worker. The other callsite in nfsd_file_put doesn't need to count entries, since it only schedules the laundrette after adding an entry to the LRU. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | nfsd: reorganize filecache.cJeff Layton2022-11-282-57/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a coming patch, we're going to rework how the filecache refcounting works. Move some code around in the function to reduce the churn in the later patches, and rename some of the functions with (hopefully) clearer names: nfsd_file_flush becomes nfsd_file_fsync, and nfsd_file_unhash_and_dispose is renamed to nfsd_file_unhash_and_queue. Also, the nfsd_file_put_final tracepoint is renamed to nfsd_file_free, to better match the name of the function from which it's called. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | nfsd: remove the pages_flushed statistic from filecacheJeff Layton2022-11-281-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | We're counting mapping->nrpages, but not all of those are necessarily dirty. We don't really have a simple way to count just the dirty pages, so just remove this stat since it's not accurate. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: Fix licensing header in filecache.cChuck Lever2022-11-281-1/+2
| | | | | | | | | | | | | | | | | | | | | Add a missing SPDX header. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Use rhashtable for managing nfs4_file objectsChuck Lever2022-11-282-39/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fh_match() is costly, especially when filehandles are large (as is the case for NFSv4). It needs to be used sparingly when searching data structures. Unfortunately, with common workloads, I see multiple thousands of objects stored in file_hashtbl[], which has just 256 buckets, making its bucket hash chains quite lengthy. Walking long hash chains with the state_lock held blocks other activity that needs that lock. Sizable hash chains are a common occurrance once the server has handed out some delegations, for example -- IIUC, each delegated file is held open on the server by an nfs4_file object. To help mitigate the cost of searching with fh_match(), replace the nfs4_file hash table with an rhashtable, which can dynamically resize its bucket array to minimize hash chain length. The result of this modification is an improvement in the latency of NFSv4 operations, and the reduction of nfsd CPU utilization due to eliminating the cost of multiple calls to fh_match() and reducing the CPU cache misses incurred while walking long hash chains in the nfs4_file hash table. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Refactor find_file()Chuck Lever2022-11-281-21/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | find_file() is now the only caller of find_file_locked(), so just fold these two together. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Clean up find_or_add_file()Chuck Lever2022-11-281-36/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the call to find_file_locked() in insert_nfs4_file(). Tracing shows that over 99% of these calls return NULL. Thus it is not worth the expense of the extra bucket list traversal. insert_file() already deals correctly with the case where the item is already in the hash bucket. Since nfsd4_file_hash_insert() is now just a wrapper around insert_file(), move the meat of insert_file() into nfsd4_file_hash_insert() and get rid of it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neilb@suse.de>
| * | NFSD: Add a nfsd4_file_hash_remove() helperChuck Lever2022-11-281-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor to relocate hash deletion operation to a helper function that is close to most other nfs4_file data structure operations. The "noinline" annotation will become useful in a moment when the hlist_del_rcu() is replaced with a more complex rhash remove operation. It also guarantees that hash remove operations can be traced with "-p function -l remove_nfs4_file_locked". This also simplifies the organization of forward declarations: the to-be-added rhashtable and its param structure will be defined /after/ put_nfs4_file(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Clean up nfsd4_init_file()Chuck Lever2022-11-281-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Name this function more consistently. I'm going to use nfsd4_file_ and nfsd4_file_hash_ for these helpers. Change the @fh parameter to be const pointer for better type safety. Finally, move the hash insertion operation to the caller. This is typical for most other "init_object" type helpers, and it is where most of the other nfs4_file hash table operations are located. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Update file_hashtbl() helpersChuck Lever2022-11-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Enable callers to use const pointers for type safety. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Use const pointers as parameters to fh_ helpersChuck Lever2022-11-281-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | Enable callers to use const pointers where they are able to. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de>
| * | NFSD: Trace delegation revocationsChuck Lever2022-11-282-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Delegation revocation is an exceptional event that is not otherwise visible externally (eg, no network traffic is emitted). Generate a trace record when it occurs so that revocation can be observed or other activity can be triggered. Example: nfsd-1104 [005] 1912.002544: nfsd_stid_revoke: client 633c9343:4e82788d stateid 00000003:00000001 ref=2 type=DELEG Trace infrastructure is provided for subsequent additional tracing related to nfs4_stid activity. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Trace stateids returned via DELEGRETURNChuck Lever2022-11-282-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Handing out a delegation stateid is recorded with the nfsd_deleg_read tracepoint, but there isn't a matching tracepoint for recording when the stateid is returned. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Clean up nfs4_preprocess_stateid_op() call sitesChuck Lever2022-11-281-24/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the lame-duck dprintk()s around nfs4_preprocess_stateid_op() call sites. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de>
| * | NFSD: Flesh out a documenting comment for filecache.cChuck Lever2022-11-281-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | Record what we've learned recently about the NFSD filecache in a documenting comment so our future selves don't forget what all this is for. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collectionChuck Lever2022-11-285-13/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NFSv4 operations manage the lifetime of nfsd_file items they use by means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be garbage collected. Introduce a mechanism to enable garbage collection for nfsd_file items used only by NFSv2/3 callers. Note that the change in nfsd_file_put() ensures that both CLOSE and DELEGRETURN will actually close out and free an nfsd_file on last reference of a non-garbage-collected file. Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394 Suggested-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * | NFSD: Revert "NFSD: NFSv4 CLOSE should release an nfsd_file immediately"Chuck Lever2022-11-283-21/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 5e138c4a750dc140d881dab4a8804b094bbc08d2. That commit attempted to make files available to other users as soon as all NFSv4 clients were done with them, rather than waiting until the filecache LRU had garbage collected them. It gets the reference counting wrong, for one thing. But it also misses that DELEGRETURN should release a file in the same fashion. In fact, any nfsd_file_put() on an file held open by an NFSv4 client needs potentially to release the file immediately... Clear the way for implementing that idea. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de>
| * | NFSD: Pass the target nfsd_file to nfsd_commit()Chuck Lever2022-11-284-14/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a moment I'm going to introduce separate nfsd_file types, one of which is garbage-collected; the other, not. The garbage-collected variety is to be used by NFSv2 and v3, and the non-garbage-collected variety is to be used by NFSv4. nfsd_commit() is invoked by both NFSv3 and NFSv4 consumers. We want nfsd_commit() to find and use the correct variety of cached nfsd_file object for the NFS version that is in use. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de>
| * | nfsd: don't call nfsd_file_put from client states seqfile displayJeff Layton2022-11-281-18/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had a report of this: BUG: sleeping function called from invalid context at fs/nfsd/filecache.c:440 ...with a stack trace showing nfsd_file_put being called from nfs4_show_open. This code has always tried to call fput while holding a spinlock, but we recently changed this to use the filecache, and that started triggering the might_sleep() in nfsd_file_put. states_start takes and holds the cl_lock while iterating over the client's states, and we can't sleep with that held. Have the various nfs4_show_* functions instead hold the fi_lock instead of taking a nfsd_file reference. Fixes: 78599c42ae3c ("nfsd4: add file to display list of client's opens") Link: https://bugzilla.redhat.com/show_bug.cgi?id=2138357 Reported-by: Zhi Li <yieli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | nfsd: allow disabling NFSv2 at compile timeJeff Layton2022-11-285-8/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rpc.nfsd stopped supporting NFSv2 a year ago. Take the next logical step toward deprecating it and allow NFSv2 support to be compiled out. Add a new CONFIG_NFSD_V2 option that can be turned off and rework the CONFIG_NFSD_V?_ACL option dependencies. Add a description that discourages enabling it. Also, change the description of CONFIG_NFSD to state that the always-on version is now 3 instead of 2. Finally, add an #ifdef around "case 2:" in __write_versions. When NFSv2 is disabled at compile time, this should make the kernel ignore attempts to disable it at runtime, but still error out when trying to enable it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Tom Talpey <tom@talpey.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | nfsd: move nfserrno() to vfs.cJeff Layton2022-11-288-63/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfserrno() is common to all nfs versions, but nfsproc.c is specifically for NFSv2. Move it to vfs.c, and the prototype to vfs.h. While we're in here, remove the #ifdef EDQUOT check in this function. It's apparently a holdover from the initial merge of the nfsd code in 1997. No other place in the kernel checks that that symbol is defined before using it, so I think we can dispense with it here. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | nfsd: ignore requests to disable unsupported versionsJeff Layton2022-11-281-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel currently errors out if you attempt to enable or disable a version that it doesn't recognize. Change it to ignore attempts to disable an unrecognized version. If we don't support it, then there is no harm in doing so. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Tom Talpey <tom@talpey.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: Finish converting the NFSv3 GETACL result encoderChuck Lever2022-11-281-24/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For some reason, the NFSv2 GETACL result encoder was fully converted to use the new nfs_stream_encode_acl(), but the NFSv3 equivalent was not similarly converted. Fixes: 20798dfe249a ("NFSD: Update the NFSv3 GETACL result encoder to use struct xdr_stream") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: Finish converting the NFSv2 GETACL result encoderChuck Lever2022-11-281-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xdr_stream conversion inadvertently left some code that set the page_len of the send buffer. The XDR stream encoders should handle this automatically now. This oversight adds garbage past the end of the Reply message. Clients typically ignore the garbage, but NFSD does not need to send it, as it leaks stale memory contents onto the wire. Fixes: f8cba47344f7 ("NFSD: Update the NFSv2 GETACL result encoder to use struct xdr_stream") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: Remove redundant assignment to variable host_errColin Ian King2022-11-281-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Variable host_err is assigned a value that is never read, it is being re-assigned a value in every different execution path in the following switch statement. The assignment is redundant and can be removed. Cleans up clang-scan warning: warning: Value stored to 'host_err' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | NFSD: Simplify READ_PLUSAnna Schumaker2022-11-281-107/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Chuck had suggested reverting READ_PLUS so it returns a single DATA segment covering the requested read range. This prepares the server for a future "sparse read" function so support can easily be added without needing to rip out the old READ_PLUS code at the same time. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | | Merge tag 'fs.acl.rework.v6.2' of ↵Linus Torvalds2022-12-134-12/+12
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull VFS acl updates from Christian Brauner: "This contains the work that builds a dedicated vfs posix acl api. The origins of this work trace back to v5.19 but it took quite a while to understand the various filesystem specific implementations in sufficient detail and also come up with an acceptable solution. As we discussed and seen multiple times the current state of how posix acls are handled isn't nice and comes with a lot of problems: The current way of handling posix acls via the generic xattr api is error prone, hard to maintain, and type unsafe for the vfs until we call into the filesystem's dedicated get and set inode operations. It is already the case that posix acls are special-cased to death all the way through the vfs. There are an uncounted number of hacks that operate on the uapi posix acl struct instead of the dedicated vfs struct posix_acl. And the vfs must be involved in order to interpret and fixup posix acls before storing them to the backing store, caching them, reporting them to userspace, or for permission checking. Currently a range of hacks and duct tape exist to make this work. As with most things this is really no ones fault it's just something that happened over time. But the code is hard to understand and difficult to maintain and one is constantly at risk of introducing bugs and regressions when having to touch it. Instead of continuing to hack posix acls through the xattr handlers this series builds a dedicated posix acl api solely around the get and set inode operations. Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl() helpers must be used in order to interact with posix acls. They operate directly on the vfs internal struct posix_acl instead of abusing the uapi posix acl struct as we currently do. In the end this removes all of the hackiness, makes the codepaths easier to maintain, and gets us type safety. This series passes the LTP and xfstests suites without any regressions. For xfstests the following combinations were tested: - xfs - ext4 - btrfs - overlayfs - overlayfs on top of idmapped mounts - orangefs - (limited) cifs There's more simplifications for posix acls that we can make in the future if the basic api has made it. A few implementation details: - The series makes sure to retain exactly the same security and integrity module permission checks. Especially for the integrity modules this api is a win because right now they convert the uapi posix acl struct passed to them via a void pointer into the vfs struct posix_acl format to perform permission checking on the mode. There's a new dedicated security hook for setting posix acls which passes the vfs struct posix_acl not a void pointer. Basing checking on the posix acl stored in the uapi format is really unreliable. The vfs currently hacks around directly in the uapi struct storing values that frankly the security and integrity modules can't correctly interpret as evidenced by bugs we reported and fixed in this area. It's not necessarily even their fault it's just that the format we provide to them is sub optimal. - Some filesystems like 9p and cifs need access to the dentry in order to get and set posix acls which is why they either only partially or not even at all implement get and set inode operations. For example, cifs allows setxattr() and getxattr() operations but doesn't allow permission checking based on posix acls because it can't implement a get acl inode operation. Thus, this patch series updates the set acl inode operation to take a dentry instead of an inode argument. However, for the get acl inode operation we can't do this as the old get acl method is called in e.g., generic_permission() and inode_permission(). These helpers in turn are called in various filesystem's permission inode operation. So passing a dentry argument to the old get acl inode operation would amount to passing a dentry to the permission inode operation which we shouldn't and probably can't do. So instead of extending the existing inode operation Christoph suggested to add a new one. He also requested to ensure that the get and set acl inode operation taking a dentry are consistently named. So for this version the old get acl operation is renamed to ->get_inode_acl() and a new ->get_acl() inode operation taking a dentry is added. With this we can give both 9p and cifs get and set acl inode operations and in turn remove their complex custom posix xattr handlers. In the future I hope to get rid of the inode method duplication but it isn't like we have never had this situation. Readdir is just one example. And frankly, the overall gain in type safety and the more pleasant api wise are simply too big of a benefit to not accept this duplication for a while. - We've done a full audit of every codepaths using variant of the current generic xattr api to get and set posix acls and surprisingly it isn't that many places. There's of course always a chance that we might have missed some and if so I'm sure we'll find them soon enough. The crucial codepaths to be converted are obviously stacking filesystems such as ecryptfs and overlayfs. For a list of all callers currently using generic xattr api helpers see [2] including comments whether they support posix acls or not. - The old vfs generic posix acl infrastructure doesn't obey the create and replace semantics promised on the setxattr(2) manpage. This patch series doesn't address this. It really is something we should revisit later though. The patches are roughly organized as follows: (1) Change existing set acl inode operation to take a dentry argument (Intended to be a non-functional change) (2) Rename existing get acl method (Intended to be a non-functional change) (3) Implement get and set acl inode operations for filesystems that couldn't implement one before because of the missing dentry. That's mostly 9p and cifs (Intended to be a non-functional change) (4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl() including security and integrity hooks (Intended to be a non-functional change) (5) Implement get and set acl inode operations for stacking filesystems (Intended to be a non-functional change) (6) Switch posix acl handling in stacking filesystems to new posix acl api now that all filesystems it can stack upon support it. (7) Switch vfs to new posix acl api (semantical change) (8) Remove all now unused helpers (9) Additional regression fixes reported after we merged this into linux-next Thanks to Seth for a lot of good discussion around this and encouragement and input from Christoph" * tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits) posix_acl: Fix the type of sentinel in get_acl orangefs: fix mode handling ovl: call posix_acl_release() after error checking evm: remove dead code in evm_inode_set_acl() cifs: check whether acl is valid early acl: make vfs_posix_acl_to_xattr() static acl: remove a slew of now unused helpers 9p: use stub posix acl handlers cifs: use stub posix acl handlers ovl: use stub posix acl handlers ecryptfs: use stub posix acl handlers evm: remove evm_xattr_acl_change() xattr: use posix acl api ovl: use posix acl api ovl: implement set acl method ovl: implement get acl method ecryptfs: implement set acl method ecryptfs: implement get acl method ksmbd: use vfs_remove_acl() acl: add vfs_remove_acl() ...
| * | | fs: rename current get acl methodChristian Brauner2022-10-203-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. The current inode operation for getting posix acls takes an inode argument but various filesystems (e.g., 9p, cifs, overlayfs) need access to the dentry. In contrast to the ->set_acl() inode operation we cannot simply extend ->get_acl() to take a dentry argument. The ->get_acl() inode operation is called from: acl_permission_check() -> check_acl() -> get_acl() which is part of generic_permission() which in turn is part of inode_permission(). Both generic_permission() and inode_permission() are called in the ->permission() handler of various filesystems (e.g., overlayfs). So simply passing a dentry argument to ->get_acl() would amount to also having to pass a dentry argument to ->permission(). We should avoid this unnecessary change. So instead of extending the existing inode operation rename it from ->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that passes a dentry argument and which filesystems that need access to the dentry can implement instead of ->get_inode_acl(). Filesystems like cifs which allow setting and getting posix acls but not using them for permission checking during lookup can simply not implement ->get_inode_acl(). This is intended to be a non-functional change. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | fs: pass dentry to set acl methodChristian Brauner2022-10-193-6/+6
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. Since some filesystem rely on the dentry being available to them when setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode operation. But since ->set_acl() is required in order to use the generic posix acl xattr handlers filesystems that do not implement this inode operation cannot use the handler and need to implement their own dedicated posix acl handlers. Update the ->set_acl() inode method to take a dentry argument. This allows all filesystems to rely on ->set_acl(). As far as I can tell all codepaths can be switched to rely on the dentry instead of just the inode. Note that the original motivation for passing the dentry separate from the inode instead of just the dentry in the xattr handlers was because of security modules that call security_d_instantiate(). This hook is called during d_instantiate_new(), d_add(), __d_instantiate_anon(), and d_splice_alias() to initialize the inode's security context and possibly to set security.* xattrs. Since this only affects security.* xattrs this is completely irrelevant for posix acls. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | | Merge tag 'pull-iov_iter' of ↵Linus Torvalds2022-12-131-2/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "iov_iter work; most of that is about getting rid of direction misannotations and (hopefully) preventing more of the same for the future" * tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: use less confusing names for iov_iter direction initializers iov_iter: saner checks for attempt to copy to/from iterator [xen] fix "direction" argument of iov_iter_kvec() [vhost] fix 'direction' argument of iov_iter_{init,bvec}() [target] fix iov_iter_bvec() "direction" argument [s390] memcpy_real(): WRITE is "data source", not destination... [s390] zcore: WRITE is "data source", not destination... [infiniband] READ is "data destination", not source... [fsi] WRITE is "data source", not destination... [s390] copy_oldmem_kernel() - WRITE is "data source", not destination csum_and_copy_to_iter(): handle ITER_DISCARD get rid of unlikely() on page_copy_sane() calls