summaryrefslogtreecommitdiffstats
path: root/fs/ceph/caps.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* ceph: fix null pointer dereference in ceph_flush_snaps()Yan, Zheng2016-08-081-1/+4
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: optimize cap flush waitingYan, Zheng2016-07-281-27/+64
| | | | | | | | | | | | Add a 'wake' flag to ceph_cap_flush struct, which indicates if there is someone waiting for it to finish. When getting flush ack message, we check the 'wake' flag in corresponding ceph_cap_flush struct to decide if we should wake up waiters. One corner case is that the acked cap flush has 'wake' flags is set, but it is not the first one on the flushing list. We do not wake up waiters in this case, set 'wake' flags of preceding ceph_cap_flush struct instead Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: cleanup ceph_flush_snaps()Yan, Zheng2016-07-281-83/+102
| | | | | | | | | | | | | | This patch devide __ceph_flush_snaps() into two stags. In the first stage, __ceph_flush_snaps() assign snapcaps flush TIDs and add them to cap flush lists. __ceph_flush_snaps() keeps holding the i_ceph_lock in this stagge. So inode's auth cap can not change. In the second stage, __ceph_flush_snaps() send flushsnap cap messages. i_ceph_lock is unlocked before sending each cap message. If auth cap changes in the middle, __ceph_flush_snaps() just stops. This is OK because kick_flushing_inode_caps() will re-send flushsnap cap messages to inode's new auth MDS. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: kick cap flushes before sending other cap messageYan, Zheng2016-07-281-9/+34
| | | | | | | If ceph_check_caps() wants to send cap message to a recovering MDS, make sure it kicks cap flushes first. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: introduce an inode flag to indicates if snapflush is neededYan, Zheng2016-07-281-15/+33
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: avoid sending duplicated cap flush messageYan, Zheng2016-07-281-1/+17
| | | | | | | make ceph_kick_flushing_caps() ignore inodes whose cap flushes have already been re-sent by ceph_early_kick_flushing_caps() Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: unify cap flush and snapcap flushYan, Zheng2016-07-281-135/+156
| | | | | | | | | | | | This patch includes following changes - Assign flush tid to snapcap flush - Remove session's s_cap_snaps_flushing list. Add inode to session's s_cap_flushing list instead. Inode is removed from the list when there is no pending snapcap flush or cap flush. - make __kick_flushing_caps() re-send both snapcap flushes and cap flushes. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: use list instead of rbtree to track cap flushesYan, Zheng2016-07-281-89/+31
| | | | | | | | We don't have requirement of searching cap flush by TID. In most cases, we just need to know TID of the oldest cap flush. List is ideal for this usage. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: update types of some local variblesYan, Zheng2016-07-281-11/+12
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: wait unsafe sync writes for evicting inodeYan, Zheng2016-07-281-48/+2
| | | | | | Otherwise ceph_sync_write_unsafe() may access/modify freed inode. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: reduce i_nr_by_mode array sizeYan, Zheng2016-07-281-14/+31
| | | | | | | Track usage count for individual fmode bit. This can reduce the array size by half. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: rados pool namespace supportYan, Zheng2016-07-281-20/+26
| | | | | | | | This patch adds codes that decode pool namespace information in cap message and request reply. Pool namespace is saved in i_layout, it will be passed to libceph when doing read/write. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: define new ceph_file_layout structureYan, Zheng2016-07-281-1/+4
| | | | | | | | Define new ceph_file_layout structure and rename old ceph_file_layout to ceph_file_layout_legacy. This is preparation for adding namespace to ceph_file_layout structure. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: improve fscache revalidationYan, Zheng2016-06-011-10/+6
| | | | | | | | | | | | | | | | | | | | | | | There are several issues in fscache revalidation code. - In ceph_revalidate_work(), fscache_invalidate() is called when fscache_check_consistency() return 0. This is complete wrong because 0 means cache is valid. - Handle_cap_grant() calls ceph_queue_revalidate() if client already has CAP_FILE_CACHE. This code is confusing. Client should revalidate the cache each time it got CAP_FILE_CACHE anew. - In Handle_cap_grant(), fscache_invalidate() is called if MDS revokes CAP_FILE_CACHE. This is inconsistency with the case that inode get evicted. In the later case, the cache is not discarded. Client may use the cache when inode is reloaded. This patch moves the fscache revalidation into ceph_get_caps(). Client revalidates the cache after it gets CAP_FILE_CACHE. i_rdcache_gen should keep constance while CAP_FILE_CACHE is used. If i_fscache_gen is not equal to i_rdcache_gen, client needs to check cache's consistency. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: avoid unnecessary fscache invalidation/revlidationYan, Zheng2016-06-011-6/+3
| | | | | | | ceph_fill_file_size() has already called ceph_fscache_invalidate() if it return true. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: don't use truncate_pagecache() to invalidate read cacheYan, Zheng2016-05-261-4/+4
| | | | | | | | | truncate_pagecache() drops dirty pages, it's dangerous to use it to invalidate read cache. Besides, we shouldn't start invalidating read cache while there are buffer writers. Because buffer writers may add dirty pages later. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: renew caps for read/write if mds session got killed.Yan, Zheng2016-05-261-10/+33
| | | | | | | | | When mds session gets killed, read/write operation may hang. Client waits for Frw caps, but mds does not know what caps client wants. To recover this, client sends an open request to mds. The request will tell mds what caps client wants. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov2016-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ceph: encode ctime in cap messageYan, Zheng2016-03-251-4/+7
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 supportYan, Zheng2016-03-041-3/+24
| | | | | | | | Add support for the format change of MClientReply/MclientCaps. Also add code that denies access to inodes with pool_ns layouts. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* wrappers for ->i_mutex accessAl Viro2016-01-231-2/+2
| | | | | | | | | | | parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ceph: make fsync() wait unsafe requests that created/modified inodeYan, Zheng2015-11-021-37/+34
| | | | | | | | If we get a unsafe reply for request that created/modified inode, add the unsafe request to a list in the newly created/modified inode. So we can make fsync() wait these unsafe requests. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: don't invalidate page cache when inode is no longer usedYan, Zheng2015-11-021-3/+2
| | | | | | | | ceph_check_caps() invalidate page cache when inode is not used by any open file. This behaviour is not friendly for workload that repeatly read files. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: EIO all operations after forced umountYan, Zheng2015-09-081-0/+8
| | | | | | | | | This patch makes try_get_cap_refs() and __do_request() check if the file system was forced umount, and return -EIO if it was. This patch also adds a helper function to drops dirty caps and wakes up blocking operation. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: always re-send cap flushes when MDS recoversYan, Zheng2015-07-311-17/+5
| | | | | | | | | | | | | | | commit e548e9b93d3e565e42b938a99804114565be1f81 makes the kclient only re-send cap flush once during MDS failover. If the kclient sends a cap flush after MDS enters reconnect stage but before MDS recovers. The kclient will skip re-sending the same cap flush when MDS recovers. This causes problem for newly created inode. The MDS handles cap flushes before replaying unsafe requests, so it's possible that MDS find corresponding inode is missing when handling cap flush. The fix is reverting to old behaviour: always re-send when MDS recovers Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: rework dcache readdirYan, Zheng2015-06-251-5/+9
| | | | | | | | | | | | | | | | Previously our dcache readdir code relies on that child dentries in directory dentry's d_subdir list are sorted by dentry's offset in descending order. When adding dentries to the dcache, if a dentry already exists, our readdir code moves it to head of directory dentry's d_subdir list. This design relies on dcache internals. Al Viro suggests using ncpfs's approach: keeping array of pointers to dentries in page cache of directory inode. the validity of those pointers are presented by directory inode's complete and ordered flags. When a dentry gets pruned, we clear directory inode's complete flag in the d_prune() callback. Before moving a dentry to other directory, we clear the ordered flag for both old and new directory. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: pre-allocate data structure that tracks caps flushingYan, Zheng2015-06-251-4/+22
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: re-send flushing caps (which are revoked) in reconnect stageYan, Zheng2015-06-251-4/+53
| | | | | | | | if flushing caps were revoked, we should re-send the cap flush in client reconnect stage. This guarantees that MDS processes the cap flush message before issuing the flushing caps to other client. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: send TID of the oldest pending caps flush to MDSYan, Zheng2015-06-251-18/+49
| | | | | | | According to this information, MDS can trim its completed caps flush list (which is used to detect duplicated cap flush). Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: track pending caps flushing globallyYan, Zheng2015-06-251-9/+41
| | | | | | | | | | So we know TID of the oldest pending caps flushing. Later patch will send this information to MDS, so that MDS can trim its completed caps flush list. Tracking pending caps flushing globally also simplifies syncfs code. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: track pending caps flushing accuratelyYan, Zheng2015-06-251-85/+160
| | | | | | | | | | | | | | | Previously we do not trace accurate TID for flushing caps. when MDS failovers, we have no choice but to re-send all flushing caps with a new TID. This can cause problem because MDS can has already flushed some caps and has issued the same caps to other client. The re-sent cap flush has a new TID, which makes MDS unable to detect if it has already processed the cap flush. This patch adds code to track pending caps flushing accurately. When re-sending cap flush is needed, we use its original flush TID. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: fix directory fsyncYan, Zheng2015-06-251-9/+64
| | | | | | | | fsync() on directory should flush dirty caps and wait for any uncommitted directory opertions to commit. But ceph_dir_fsync() only waits for uncommitted directory opertions. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: fix flushing capsYan, Zheng2015-06-251-24/+25
| | | | | | | | | | Current ceph_fsync() only flushes dirty caps and wait for them to be flushed. It doesn't wait for caps that has already been flushing. This patch makes ceph_fsync() wait for pending flushing caps too. Besides, this patch also makes caps_are_flushed() peroperly handle tid wrapping. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: don't include used caps in cap_wantedYan, Zheng2015-06-251-3/+3
| | | | | | | | | | when copying files to cephfs, file data may stay in page cache after corresponding file is closed. Cached data use Fc capability. If we include Fc capability in cap_wanted, MDS will treat files with cached data as open files, and journal them in an EOpen event when trimming log segment. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: don't pre-allocate space for cap release messagesYan, Zheng2015-06-251-56/+29
| | | | | | | | | | | Previously we pre-allocate cap release messages for each caps. This wastes lots of memory when there are large amount of caps. This patch make the code not pre-allocate the cap release messages. Instead, we add the corresponding ceph_cap struct to a list when releasing a cap. Later when flush cap releases is needed, we allocate the cap release messages dynamically. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: make sure syncfs flushes all cap snapsYan, Zheng2015-06-251-7/+11
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: take snap_rwsem when accessing snap realm's cached_contextYan, Zheng2015-06-251-1/+3
| | | | | | | | When ceph inode's i_head_snapc is NULL, __ceph_mark_dirty_caps() accesses snap realm's cached_context. So we need take read lock of snap_rwsem. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: avoid sending unnessesary FLUSHSNAP messageYan, Zheng2015-06-251-31/+38
| | | | | | | | | when a snap notification contains no new snapshot, we can avoid sending FLUSHSNAP message to MDS. But we still need to create cap_snap in some case because it's required by write path and page writeback path Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR referenceYan, Zheng2015-06-251-46/+109
| | | | | | | | | | | | | | | | | | In most cases that snap context is needed, we are holding reference of CEPH_CAP_FILE_WR. So we can set ceph inode's i_head_snapc when getting the CEPH_CAP_FILE_WR reference, and make codes get snap context from i_head_snapc. This makes the code simpler. Another benefit of this change is that we can handle snap notification more elegantly. Especially when snap context is updated while someone else is doing write. The old queue cap_snap code may set cap_snap's context to ether the old context or the new snap context, depending on if i_head_snapc is set. The new queue capp_snap code always set cap_snap's context to the old snap context. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: check OSD caps before read/writeYan, Zheng2015-06-251-0/+4
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-04-271-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
| * VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells2015-04-151-1/+1
| | | | | | | | | | | | | | that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ceph: hold on to exclusive caps on complete directoriesYan, Zheng2015-04-201-0/+12
| | | | | | | | | | | | | | | | If a directory is complete, we want to keep the exclusive cap. So that MDS does not end up revoking the shared cap on every create/unlink operation. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* | ceph: don't mark dirty caps when there is no auth capYan, Zheng2015-04-201-1/+7
| | | | | | | | | | | | | | No i_auth_cap means reconnecting to MDS was denied. So don't add new dirty caps. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* | ceph: keep i_snap_realm while there are writersYan, Zheng2015-04-201-9/+22
|/ | | | | | | | | when reconnecting to MDS is denied, we remove session caps forcibly. But it's possible there are ongoing write, the write code needs to reference i_snap_realm. So if there are ongoing write, we keep i_snap_realm. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)Yan, Zheng2015-02-191-44/+42
| | | | | | | | we should not do block operation in wait_event_interruptible()'s condition check function, but reading inline data can block. so move the read inline data code to ceph_get_caps() Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)Yan, Zheng2015-02-191-1/+1
| | | | | | | check_cap_flush() calls mutex_lock(), which may block. So we can't use it as condition check function for wait_event(); Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: improve reference tracking for snaprealmYan, Zheng2015-02-191-8/+16
| | | | | | | | | | | | When snaprealm is created, its initial reference count is zero. But in some rare cases, the newly created snaprealm is not referenced by anyone. This causes snaprealm with zero reference count not freed. The fix is set reference count of newly snaprealm to 1. The reference is return the function who requests to create the snaprealm. When the function finishes its job, it releases the reference. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: handle SESSION_FORCE_RO messageYan, Zheng2015-02-191-0/+15
| | | | | | mark session as readonly and wake up all cap waiters. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: flush inline versionYan, Zheng2014-12-171-4/+20
| | | | | | | | | | After converting inline data to normal data, client need to flush the new i_inline_version (CEPH_INLINE_NONE) to MDS. This commit makes cap messages (sent to MDS) contain inline_version and inline_data. Client always converts inline data to normal data before data write, so the inline data length part is always zero. Signed-off-by: Yan, Zheng <zyan@redhat.com>