summaryrefslogtreecommitdiffstats
path: root/fs/ceph/dir.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds2024-09-281-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "Three CephFS fixes from Xiubo and Luis and a bunch of assorted cleanups" * tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client: ceph: remove the incorrect Fw reference check when dirtying pages ceph: Remove empty definition in header file ceph: Fix typo in the comment ceph: fix a memory leak on cap_auths in MDS client ceph: flush all caps releases when syncing the whole filesystem ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases() libceph: use min() to simplify code in ceph_dns_resolve_name() ceph: Convert to use jiffies macro ceph: Remove unused declarations
| * ceph: Fix typo in the commentYan Zhen2024-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Correctly spelled comments make it easier for the reader to understand the code. replace 'tagert' with 'target' in the comment & replace 'vaild' with 'valid' in the comment & replace 'carefull' with 'careful' in the comment & replace 'trsaverse' with 'traverse' in the comment. Signed-off-by: Yan Zhen <yanzhen@vivo.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | ceph: remove unused f_versionChristian Brauner2024-09-091-1/+0
|/ | | | | | | | | It's not used for ceph so don't bother with it at all. Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-3-6d3e4816aa7b@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* Merge tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds2024-07-261-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "A small patchset to address bogus I/O errors and ultimately an assertion failure in the face of watch errors with -o exclusive mappings in RBD marked for stable and some assorted CephFS fixes" * tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client: rbd: don't assume rbd_is_lock_owner() for exclusive mappings rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait ceph: fix incorrect kmalloc size of pagevec mempool ceph: periodically flush the cap releases ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch() ceph: use cap_wait_list only if debugfs is enabled
| * ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch()Chen Ni2024-07-231-1/+1
| | | | | | | | | | | | | | | | Replace a comma between expression statements by a semicolon. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | ceph: drop usage of page_indexKairui Song2024-07-041-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | page_index is needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use page->index instead. It can't be a swap cache page here, so just drop it. Link: https://lkml.kernel.org/r/20240521175854.96038-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* ceph: check the cephx mds auth access for async diropXiubo Li2024-05-231-0/+28
| | | | | | | | | Before doing the op locally we need to check the cephx access. Link: https://tracker.ceph.com/issues/61333 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Merge tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds2024-01-191-8/+13
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "Assorted CephFS fixes and cleanups with nothing standing out" * tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client: ceph: get rid of passing callbacks in __dentry_leases_walk() ceph: d_obtain_{alias,root}(ERR_PTR(...)) will do the right thing ceph: fix invalid pointer access if get_quota_realm return ERR_PTR ceph: remove duplicated code in ceph_netfs_issue_read() ceph: send oldest_client_tid when renewing caps ceph: rename create_session_open_msg() to create_session_full_msg() ceph: select FS_ENCRYPTION_ALGS if FS_ENCRYPTION ceph: fix deadlock or deadcode of misusing dget() ceph: try to allocate a smaller extent map for sparse read libceph: remove MAX_EXTENTS check for sparse reads ceph: reinitialize mds feature bit even when session in open ceph: skip reconnecting if MDS is not ready
| * ceph: get rid of passing callbacks in __dentry_leases_walk()Al Viro2024-01-151-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __dentry_leases_walk() gets a callback and calls it for a bunch of denties; there are exactly two callers and we already have a flag telling them apart - lwc->dir_lease. Seeing that indirect calls are costly these days, let's get rid of the callback and just call the right function directly. Has a side benefit of saner signatures... [ xiubli: a minor fix in the commit title ] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | dentry: switch the lists of children to hlistAl Viro2023-11-251-1/+1
|/ | | | | | | | | | | | Saves a pointer per struct dentry and actually makes the things less clumsy. Cleaned the d_walk() and dcache_readdir() a bit by use of hlist_for_... iterators. A couple of new helpers - d_first_child() and d_next_sibling(), to make the expressions less awful. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ceph: pass an idmapping to mknod/symlink/mkdirChristian Brauner2023-11-031-0/+4
| | | | | | | | | | Enable mknod/symlink/mkdir iops to handle idmapped mounts. This is just a matter of passing down the mount's idmapping. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: print cluster fsid and client global_id in all debug logsXiubo Li2023-11-031-86/+132
| | | | | | | | | | | | | | | | | Multiple CephFS mounts on a host is increasingly common so disambiguating messages like this is necessary and will make it easier to debug issues. At the same this will improve the debug logs to make them easier to troubleshooting issues, such as print the ino# instead only printing the memory addresses of the corresponding inodes and print the dentry names instead of the corresponding memory addresses for the dentry,etc. Link: https://tracker.ceph.com/issues/61590 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: rename _to_client() to _to_fs_client()Xiubo Li2023-11-031-11/+11
| | | | | | | | | | | | We need to covert the inode to ceph_client in the following commit, and will add one new helper for that, here we rename the old helper to _fs_client(). Link: https://tracker.ceph.com/issues/61590 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: pass the mdsc to several helpersXiubo Li2023-11-031-1/+1
| | | | | | | | | | We will use the 'mdsc' to get the global_id in the following commits. Link: https://tracker.ceph.com/issues/61590 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: switch ceph_lookup/atomic_open() to use new fscrypt helperLuís Henriques2023-08-241-6/+7
| | | | | | | | | | | | | | | Instead of setting the no-key dentry, use the new fscrypt_prepare_lookup_partial() helper. We still need to mark the directory as incomplete if the directory was just unlocked. In ceph_atomic_open() this fixes a bug where a dentry is incorrectly set with DCACHE_NOKEY_NAME when 'dir' has been evicted but the key is still available (for example, where there's a drop_caches). Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: prevent snapshot creation in encrypted locked directoriesLuís Henriques2023-08-241-0/+5
| | | | | | | | | | | | With snapshot names encryption we can not allow snapshots to be created in locked directories because the names wouldn't be encrypted. This patch forces the directory to be unlocked to allow a snapshot to be created. Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: size handling in MClientRequest, cap updates and inode tracesJeff Layton2023-08-241-0/+3
| | | | | | | | | | | | | | | | For encrypted inodes, transmit a rounded-up size to the MDS as the normal file size and send the real inode size in fscrypt_file field. Also, fix up creates and truncates to also transmit fscrypt_file. When we get an inode trace from the MDS, grab the fscrypt_file field if the inode is encrypted, and use it to populate the i_size field instead of the regular inode size field. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: mark directory as non-complete after loading keyLuís Henriques2023-08-241-4/+4
| | | | | | | | | | | | | | | | When setting a directory's crypt context, ceph_dir_clear_complete() needs to be called otherwise if it was complete before, any existing (old) dentry will still be valid. This patch adds a wrapper around __fscrypt_prepare_readdir() which will ensure a directory is marked as non-complete if key status changes. [ xiubli: revise commit title per Milind ] Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: add some fscrypt guardrailsJeff Layton2023-08-241-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | Add the appropriate calls into fscrypt for various actions, including link, rename, setattr, and the open codepaths. Disable fallocate for encrypted inodes -- hopefully, just for now. If we have an encrypted inode, then the client will need to re-encrypt the contents of the new object. Disable copy offload to or from encrypted inodes. Set i_blkbits to crypto block size for encrypted inodes -- some of the underlying infrastructure for fscrypt relies on i_blkbits being aligned to crypto blocksize. Report STATX_ATTR_ENCRYPTED on encrypted inodes. [ lhenriques: forbid encryption with striped layouts ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: create symlinks with encrypted and base64-encoded targetsJeff Layton2023-08-241-5/+49
| | | | | | | | | | | | | | | | When creating symlinks in encrypted directories, encrypt and base64-encode the target with the new inode's key before sending to the MDS. When filling a symlinked inode, base64-decode it into a buffer that we'll keep in ci->i_symlink. When get_link is called, decrypt the buffer into a new one that will hang off i_link. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: add support to readdir for encrypted namesXiubo Li2023-08-241-6/+30
| | | | | | | | | | | | | | | | | | | To make it simpler to decrypt names in a readdir reply (i.e. before we have a dentry), add a new ceph_encode_encrypted_fname()-like helper that takes a qstr pointer instead of a dentry pointer. Once we've decrypted the names in a readdir reply, we no longer need the crypttext, so overwrite them in ceph_mds_reply_dir_entry with the unencrypted names. Then in both ceph_readdir_prepopulate() and ceph_readdir() we will use the dencrypted name directly. [ jlayton: convert some BUG_ONs into error returns ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: make d_revalidate call fscrypt revalidator for encrypted dentriesJeff Layton2023-08-241-2/+7
| | | | | | | | | | | | If we have a dentry which represents a no-key name, then we need to test whether the parent directory's encryption key has since been added. Do that before we test anything else about the dentry. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: set DCACHE_NOKEY_NAME flag in ceph_lookup/atomic_open()Jeff Layton2023-08-241-0/+11
| | | | | | | | | | | | | | | | | This is required so that we know to invalidate these dentries when the directory is unlocked. Atomic open can act as a lookup if handed a dentry that is negative on the MDS. Ensure that we set DCACHE_NOKEY_NAME on the dentry in atomic_open, if we don't have the key for the parent. Otherwise, we can end up validating the dentry inappropriately if someone later adds a key. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: preallocate inode for ops that may create oneJeff Layton2023-08-221-32/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When creating a new inode, we need to determine the crypto context before we can transmit the RPC. The fscrypt API has a routine for getting a crypto context before a create occurs, but it requires an inode. Change the ceph code to preallocate an inode in advance of a create of any sort (open(), mknod(), symlink(), etc). Move the existing code that generates the ACL and SELinux blobs into this routine since that's mostly common across all the different codepaths. In most cases, we just want to allow ceph_fill_trace to use that inode after the reply comes in, so add a new field to the MDS request for it (r_new_inode). The async create codepath is a bit different though. In that case, we want to hash the inode in advance of the RPC so that it can be used before the reply comes in. If the call subsequently fails with -EJUKEBOX, then just put the references and clean up the as_ctx. Note that with this change, we now need to regenerate the as_ctx when this occurs, but it's quite rare for it to happen. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* vfs: get rid of old '->iterate' directory operationLinus Torvalds2023-08-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | All users now just use '->iterate_shared()', which only takes the directory inode lock for reading. Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode. This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era. The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf. Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* ceph: voluntarily drop Xx caps for requests those touch parent mtimeXiubo Li2023-06-301-7/+10
| | | | | | | | | | | | | | | | | | | For write requests the parent's mtime will be updated correspondingly. And if the 'Xx' caps is issued and when releasing other caps together with the write requests the MDS Locker will try to eval the xattr lock, which need to change the locker state excl --> sync and need to do Xx caps revocation. Just voluntarily dropping CEPH_CAP_XATTR_EXCL caps to avoid a cap revoke message, which could cause the mtime will be overwrote by stale one. [ idryomov: break unnecessarily long lines ] Link: https://tracker.ceph.com/issues/61584 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: pass ino# instead of old_dentry if it's disconnectedXiubo Li2023-04-301-2/+11
| | | | | | | | | | | | | When exporting the kceph to NFS it may pass a DCACHE_DISCONNECTED dentry for the link operation. Then it will parse this dentry as a snapdir, and the mds will fail the link request as -EROFS. MDS allow clients to pass a ino# instead of a path. Link: https://tracker.ceph.com/issues/59515 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* fs: port ->rename() to pass mnt_idmapChristian Brauner2023-01-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* fs: port ->mknod() to pass mnt_idmapChristian Brauner2023-01-191-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* fs: port ->mkdir() to pass mnt_idmapChristian Brauner2023-01-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* fs: port ->symlink() to pass mnt_idmapChristian Brauner2023-01-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* fs: port ->create() to pass mnt_idmapChristian Brauner2023-01-191-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* fs: rename current get acl methodChristian Brauner2022-10-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. The current inode operation for getting posix acls takes an inode argument but various filesystems (e.g., 9p, cifs, overlayfs) need access to the dentry. In contrast to the ->set_acl() inode operation we cannot simply extend ->get_acl() to take a dentry argument. The ->get_acl() inode operation is called from: acl_permission_check() -> check_acl() -> get_acl() which is part of generic_permission() which in turn is part of inode_permission(). Both generic_permission() and inode_permission() are called in the ->permission() handler of various filesystems (e.g., overlayfs). So simply passing a dentry argument to ->get_acl() would amount to also having to pass a dentry argument to ->permission(). We should avoid this unnecessary change. So instead of extending the existing inode operation rename it from ->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that passes a dentry argument and which filesystems that need access to the dentry can implement instead of ->get_inode_acl(). Filesystems like cifs which allow setting and getting posix acls but not using them for permission checking during lookup can simply not implement ->get_inode_acl(). This is intended to be a non-functional change. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* ceph: wait for the first reply of inflight async unlinkXiubo Li2022-08-031-9/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | In async unlink case the kclient won't wait for the first reply from MDS and just drop all the links and unhash the dentry and then succeeds immediately. For any new create/link/rename,etc requests followed by using the same file names we must wait for the first reply of the inflight unlink request, or the MDS possibly will fail these following requests with -EEXIST if the inflight async unlink request was delayed for some reasons. And the worst case is that for the none async openc request it will successfully open the file if the CDentry hasn't been unlinked yet, but later the previous delayed async unlink request will remove the CDenty. That means the just created file is possiblly deleted later by accident. We need to wait for the inflight async unlink requests to finish when creating new files/directories by using the same file names. Link: https://tracker.ceph.com/issues/55332 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: fix memory leak in ceph_readdir when note_last_dentry returns errorXiubo Li2022-03-211-1/+10
| | | | | | | | | Reset the last_readdir at the same time, and add a comment explaining why we don't free last_readdir when dir_emit returns false. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: fix comments mentioning i_mutexhongnanli2022-03-011-3/+3
| | | | | | | | | inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix comments still mentioning i_mutex. Signed-off-by: hongnanli <hongnan.li@linux.alibaba.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: take reference to req->r_parent at point of assignmentJeff Layton2021-06-291-0/+9
| | | | | | | | | | | | | Currently, we set the r_parent pointer but then don't take a reference to it until we submit the request. If we end up freeing the req before that point, then we'll do a iput when we shouldn't. Instead, take the inode reference in the callers, so that it's always safe to call ceph_mdsc_put_request on the req, even before submission. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: eliminate session->s_gen_ttl_lockJeff Layton2021-06-291-3/+1
| | | | | | | | | Turn s_cap_gen field into an atomic_t, and just rely on the fact that we hold the s_mutex when changing the s_cap_ttl field. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: allow ceph_put_mds_session to take NULL or ERR_PTRJeff Layton2021-06-291-2/+1
| | | | | | | | ...to simplify some error paths. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: fix error handling in ceph_atomic_open and ceph_lookupJeff Layton2021-06-221-10/+12
| | | | | | | | | | | | | | Commit aa60cfc3f7ee broke the error handling in these functions such that they don't handle non-ENOENT errors from ceph_mdsc_do_request properly. Move the checking of -ENOENT out of ceph_handle_snapdir and into the callers, and if we get a different error, return it immediately. Fixes: aa60cfc3f7ee ("ceph: don't use d_add in ceph_handle_snapdir") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: don't use d_add in ceph_handle_snapdirJeff Layton2021-04-271-11/+19
| | | | | | | | | | | It's possible ceph_get_snapdir could end up finding a (disconnected) inode that already exists in the cache. Change the prototype for ceph_handle_snapdir to return a dentry pointer and have it use d_splice_alias so we don't end up with an aliased dentry in the cache. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: fix fall-through warnings for ClangGustavo A. R. Silva2021-04-271-0/+2
| | | | | | | | | | | In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple of warnings by explicitly adding a break and a goto statements instead of just letting the code fall through to the next case. URL: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* fs: make helpers idmap mount awareChristian Brauner2021-01-241-11/+12
| | | | | | | | | | | | | | | | | | | | | Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* Revert "ceph: allow rename operation under different quota realms"Luis Henriques2020-12-141-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit dffdcd71458e699e839f0bf47c3d42d64210b939. When doing a rename across quota realms, there's a corner case that isn't handled correctly. Here's a testcase: mkdir files limit truncate files/file -s 10G setfattr limit -n ceph.quota.max_bytes -v 1000000 mv files limit/ The above will succeed because ftruncate(2) won't immediately notify the MDSs with the new file size, and thus the quota realms stats won't be updated. Since the possible fixes for this issue would have a huge performance impact, the solution for now is to simply revert to returning -EXDEV when doing a cross quota realms rename. URL: https://tracker.ceph.com/issues/48203 Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: add ceph_sb_to_mdsc helper support to parse the mdscXiubo Li2020-10-121-13/+7
| | | | | | | | | | This will help simplify the code. [ jlayton: fix minor merge conflict in quota.c ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Merge tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-clientLinus Torvalds2020-08-281-18/+13
|\ | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph fixes from Ilya Dryomov: "We have an inode number handling change, prompted by s390x which is a 64-bit architecture with a 32-bit ino_t, a patch to disallow leases to avoid potential data integrity issues when CephFS is re-exported via NFS or CIFS and a fix for the bulk of W=1 compilation warnings" * tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-client: ceph: don't allow setlease on cephfs ceph: fix inode number handling on arches with 32-bit ino_t libceph: add __maybe_unused to DEFINE_CEPH_FEATURE
| * ceph: fix inode number handling on arches with 32-bit ino_tJeff Layton2020-08-241-18/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tuan and Ulrich mentioned that they were hitting a problem on s390x, which has a 32-bit ino_t value, even though it's a 64-bit arch (for historical reasons). I think the current handling of inode numbers in the ceph driver is wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's actually not a problem. 32-bit arches can deal with 64-bit inode numbers just fine when userland code is compiled with LFS support (the common case these days). What we really want to do is just use 64-bit numbers everywhere, unless someone has mounted with the ino32 mount option. In that case, we want to ensure that we hash the inode number down to something that will fit in 32 bits before presenting the value to userland. Add new helper functions that do this, and only do the conversion before presenting these values to userland in getattr and readdir. The inode table hashvalue is changed to just cast the inode number to unsigned long, as low-order bits are the most likely to vary anyway. While it's not strictly required, we do want to put something in inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on the size of the ino_t type. NOTE: This is a user-visible change on 32-bit arches: 1/ inode numbers will be seen to have changed between kernel versions. 32-bit arches will see large inode numbers now instead of the hashed ones they saw before. 2/ any really old software not built with LFS support may start failing stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we can do about these, but hopefully the intersection of people running such code on ceph will be very small. The workaround for both problems is to mount with "-o ino32". [ idryomov: changelog tweak ] URL: https://tracker.ceph.com/issues/46828 Reported-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com> Reported-and-Tested-by: Tuan Hoang1 <Tuan.Hoang1@ibm.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva2020-08-241-1/+1
|/ | | | | | | | | | Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
* ceph: set sec_context xattr on symlink creationJeff Layton2020-08-041-0/+4
| | | | | | | | | | | | | Symlink inodes should have the security context set in their xattrs on creation. We already set the context on creation, but we don't attach the pagelist. The effect is that symlink inodes don't get an SELinux context set on them at creation, so they end up unlabeled instead of inheriting the proper context. Make it do so. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: allow rename operation under different quota realmsLuis Henriques2020-06-011-4/+5
| | | | | | | | | | | | | | | | | | | | | Returning -EXDEV when trying to 'mv' files/directories from different quota realms results in copy+unlink operations instead of the faster CEPH_MDS_OP_RENAME. This will occur even when there aren't any quotas set in the destination directory, or if there's enough space left for the new file(s). This patch adds a new helper function to be called on rename operations which will allow these operations if they can be executed. This patch mimics userland fuse client commit b8954e5734b3 ("client: optimize rename operation under different quota root"). Since ceph_quota_is_same_realm() is now called only from this new helper, make it static. URL: https://tracker.ceph.com/issues/44791 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>