summaryrefslogtreecommitdiffstats
path: root/fs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* GFS2: Rename quota qd_lru_lock qd_lockSteven Whitehouse2013-11-041-35/+35
| | | | | | | | | | This is a straight forward rename which is in preparation for introducing the generic list_lru infrastructure in the following patch. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Abhijith Das <adas@redhat.com> Tested-by: Abhijith Das <adas@redhat.com>
* GFS2: Use reflink for quota data cacheSteven Whitehouse2013-11-042-15/+29
| | | | | | | | | | | | | | This patch adds reflink support to the quota data cache. It looks a bit strange because we still don't have a sensible split in the lookup by id and the lru list. That is coming in later patches though. The intent here is just to swap the current ref count for reflinks in all cases with as little as possible other change. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Abhijith Das <adas@redhat.com> Tested-by: Abhijith Das <adas@redhat.com>
* GFS2: Use lockref for glocksSteven Whitehouse2013-10-154-49/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently glocks have an atomic reference count and also a spinlock which covers various internal fields, such as the state. This intent of this patch is to replace the spinlock and the atomic reference count with a lockref structure. This contains a spinlock which we can continue to use as before, and a reference counter which is used in conjuction with the spinlock to replace the previous atomic counter. As a result of this there are some new rules for reference counting on glocks. We need to distinguish between reference count changes under gl_spin (which are now just increment or decrement of the new counter, provided the count cannot hit zero) and those which are outside of gl_spin, but which now take gl_spin internally. The conversion is relatively straight forward. There is probably some further clean up which can be done, but the priority at this stage is to make the change in as simple a manner as possible. A consequence of this change is that the reference count is being decoupled from the lru list processing. This should allow future adoption of the lru_list code with glocks in due course. The reason for using the "dead" state and not just relying on 0 being the "invalid state" is so that in due course 0 ref counts can be allowable. The intent is to eventually be able to remove the ref count changes which are currently hidden away in state_change(). Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Protect quota sync generationSteven Whitehouse2013-10-043-2/+6
| | | | | | | | | | Now that gfs2_quota_sync can be potentially called from multiple threads, we should protect this bit of code, and the sync generation number in particular in order to ensure that there are no races when syncing quotas. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Abhijith Das <adas@redhat.com>
* GFS2: Inline qd_trylock into gfs2_quota_unlockSteven Whitehouse2013-10-041-25/+20
| | | | | | | | | | The function qd_trylock was not a trylock despite its name and can be inlined into gfs2_quota_unlock in order to make the code a bit clearer. There should be no functional change as a result of this patch. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Abhijith Das <adas@redhat.com>
* GFS2: Make two similar quota code fragments into a functionSteven Whitehouse2013-10-041-34/+26
| | | | | | | | | | There should be no functional change bar the removal of a test of the MS_READONLY flag which would never be reachable. This merges the common code from qd_fish and qd_trylock into a single function and calls it from both those places. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Abhijith Das <adas@redhat.com>
* GFS2: Remove obsolete quota tunableSteven Whitehouse2013-10-044-5/+1
| | | | | | | | | | | | | | | There is no need for a paramater which relates to the internals of quota to be exposed to users. The only possible use would be to turn it up so large that the memory allocation fails. So lets remove it and set it to a sensible value which ensures that we don't ask for multipage allocations. Currently the size of struct gfs2_holder means that the caluclated value is identical to the previous default value, so there should be no functional change. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Abhijith Das <adas@redhat.com>
* GFS2: Move gfs2_icbit_munge into quota.cSteven Whitehouse2013-10-023-22/+20
| | | | | | | | This function is only called twice, and both callers are quota related, so lets move this function into quota.c and make it static. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Speed up starting point selection for block allocationSteven Whitehouse2013-10-021-10/+31
| | | | | | | | | | | | | When setting the starting point for block allocation, there were calls to both gfs2_rbm_to_block() and gfs2_rbm_from_block() in the common case of there being an active reservation. The gfs2_rbm_from_block() function can be quite slow, and since the two conversions were effectively a no-op, it makes sense to avoid them entirely in this case. There is no functional change here, but the code should be a bit more efficient after this patch. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Add allocation parameters structureSteven Whitehouse2013-10-029-21/+51
| | | | | | | | | | | | This patch adds a structure to contain allocation parameters with the intention of future expansion of this structure. The idea is that we should be able to add more information about the allocation in the future in order to allow the allocator to make a better job of placing the requests on-disk. There is no functional difference from applying this patch. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Clean up reservation removalSteven Whitehouse2013-09-276-9/+10
| | | | | | | | | | | | | | | The reservation for an inode should be cleared when it is truncated so that we can start again at a different offset for future allocations. We could try and do better than that, by resetting the search based on where the truncation started from, but this is only a first step. In addition, there are three callers of gfs2_rs_delete() but only one of those should really be testing the value of i_writecount. While we get away with that in the other cases currently, I think it would be better if we made that test specific to the one case which requires it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: fix dentry leaksMiklos Szeredi2013-09-231-10/+16
| | | | | | | | | | | | | We need to dput() the result of d_splice_alias(), unless it is passed to finish_no_open(). Edited by Steven Whitehouse in order to make it apply to the current GFS2 git tree, and taking account of a prerequisite patch which hasn't been applied. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: stable@vger.kernel.org
* GFS2: new function gfs2_rbm_incrBob Peterson2013-09-181-3/+27
| | | | | | | | | | | | | Since the previous patch eliminated bi in favor of bii, this follow-on patch needed to be adjusted accordingly. Here is the revised version. This patch adds a new function, gfs2_rbm_incr, which increments an rbm structure. This is more efficient than calling gfs2_rbm_to_block, incrementing, then calling gfs2_rbm_from_block. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Introduce rbm field biiBob Peterson2013-09-182-54/+65
| | | | | | | | | | | | | | This is a respin of the original patch. As Steve pointed out, the introduction of field bii makes it easy to eliminate bi itself. This revised patch does just that, replacing bi with bii. This patch adds a new field to the rbm structure, called bii, which is an index into the array of bitmaps for an rgrp. This replaces *bi which was a pointer to the bitmap. This is being done for further optimizations. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Do not reset flags on active reservationsBob Peterson2013-09-171-1/+0
| | | | | | | | | | | | When we used try locks for rgrps on block allocations, it was important to clear the flags field so that we used a blocking hold on the glock. Now that we're not doing try locks, clearing flags is unnecessary, and a waste of time. In fact, it's probably doing the wrong thing because it clears the GL_SKIP bit that was set for the lvb tracking purposes. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: introduce bi_blocks for optimizationBob Peterson2013-09-172-1/+6
| | | | | | | | | | | This patch introduces a new field in the bitmap structure called bi_blocks. Its purpose is to save us from constantly multiplying bi_len by the constant GFS2_NBBY. It also paves the way for more optimization in a future patch. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: optimize rbm_from_block wrt bi_startBob Peterson2013-09-171-1/+1
| | | | | | | | | | | | In function gfs2_rbm_from_block, it starts by checking if the block falls within the first bitmap. It does so by checking if the rbm's offset is less than (rbm->bi->bi_start + rbm->bi->bi_len) * GFS2_NBBY. However, the first bitmap will always have bi_start==0. Therefore this is an unnecessary calculation in a function that gets called billions of times. This patch removes the reference to bi_start. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: d_splice_alias() can't return errorMiklos Szeredi2013-09-171-3/+1
| | | | | | | | | | | unless it was given an IS_ERR(inode), which isn't the case here. So clean up the unnecessary error handling in gfs2_create_inode(). This paves the way for real fixes (hence the stable Cc). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: stable@vger.kernel.org
* Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2013-09-161-5/+10
|\ | | | | | | | | | | | | | | | | | | Pull CIFS fixes from Steve French: "Two minor cifs fixes and a minor documentation cleanup for cifs.txt" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs: update cifs.txt and remove some outdated infos cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache cifs: Do not take a reference to the page in cifs_readpage_worker()
| * cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscacheSachin Prabhu2013-09-131-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reading a single page with cifs_readpage(), we make a call to fscache_read_or_alloc_page() which once done, asynchronously calls the completion function cifs_readpage_from_fscache_complete(). This completion function unlocks the page once it has been populated from cache. The module then attempts to unlock the page a second time in cifs_readpage() which leads to warning messages. In case of a successful call to fscache_read_or_alloc_page() we should skip the second unlock_page() since this will be called by the cifs_readpage_from_fscache_complete() once the page has been populated by fscache. With the modifications to cifs_readpage_worker(), we will need to re-grab the page lock in cifs_write_begin(). The problem was first noticed when testing new fscache patches for cifs. https://bugzilla.redhat.com/show_bug.cgi?id=1005737 Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
| * cifs: Do not take a reference to the page in cifs_readpage_worker()Sachin Prabhu2013-09-131-2/+3
| | | | | | | | | | | | | | | | | | | | | | We do not need to take a reference to the pagecache in cifs_readpage_worker() since the calling function will have already taken one before passing the pointer to the page as an argument to the function. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
* | Merge tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubifsLinus Torvalds2013-09-161-3/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ubifs fix from Artem Bityutskiy: "Just one patch which fixes the power-cut recovery testing mode. I'll start using a single UBI/UBIFS tree instead of 2 trees from now on. So in the future you'll get 1 small pull request instead of 2 tiny ones" * tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubifs: UBIFS: remove invalid warn msg with tst_recovery enabled
| * | UBIFS: remove invalid warn msg with tst_recovery enabledMats Kärrman2013-08-161-3/+4
| | | | | | | | | | | | | | | Signed-off-by: Mats Karrman <mats.karrman@tritech.se> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
* | | vfs: fix typo in comment in recent dentry workLinus Torvalds2013-09-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Sedat points out that I transposed some letters in "LRU" and wrote "RLU" instead in one of the new comments explaining the flow. Let's just fix it. Reported-by: Sedat Dilek <sedat.dilek@jpberlin.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'writeback-fixes' of ↵Linus Torvalds2013-09-142-4/+6
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fix from Wu Fengguang: "A trivial writeback fix" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Do not sort b_io list only because of block device inode
| * | | writeback: Do not sort b_io list only because of block device inodeJan Kara2013-07-092-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is very likely that block device inode will be part of BDI dirty list as well. However it doesn't make sence to sort inodes on the b_io list just because of this inode (as it contains buffers all over the device anyway). So save some CPU cycles which is valuable since we hold relatively contented wb->list_lock. Signed-off-by: Jan Kara <jack@suse.cz>
* | | | vfs: fix dentry LRU list handling and nr_dentry_unused accountingLinus Torvalds2013-09-141-27/+101
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The LRU list changes interacted badly with our nr_dentry_unused accounting, and even worse with the new DCACHE_LRU_LIST bit logic. This introduces helper functions to make sure everything follows the proper dcache d_lru list rules: the dentry cache is complicated by the fact that some of the hotpaths don't even want to look at the LRU list at all, and the fact that we use the same list entry in the dentry for both the LRU list and for our temporary shrinking lists when removing things from the LRU. The helper functions temporarily have some extra sanity checking for the flag bits that have to match the current LRU state of the dentry. We'll remove that before the final 3.12 release, but considering how easy it is to get wrong, this first cleanup version has some very particular sanity checking. Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds2013-09-137-269/+537
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull aio changes from Ben LaHaise: "First off, sorry for this pull request being late in the merge window. Al had raised a couple of concerns about 2 items in the series below. I addressed the first issue (the race introduced by Gu's use of mm_populate()), but he has not provided any further details on how he wants to rework the anon_inode.c changes (which were sent out months ago but have yet to be commented on). The bulk of the changes have been sitting in the -next tree for a few months, with all the issues raised being addressed" * git://git.kvack.org/~bcrl/aio-next: (22 commits) aio: rcu_read_lock protection for new rcu_dereference calls aio: fix race in ring buffer page lookup introduced by page migration support aio: fix rcu sparse warnings introduced by ioctx table lookup patch aio: remove unnecessary debugging from aio_free_ring() aio: table lookup: verify ctx pointer staging/lustre: kiocb->ki_left is removed aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" aio: be defensive to ensure request batching is non-zero instead of BUG_ON() aio: convert the ioctx list to table lookup v3 aio: double aio_max_nr in calculations aio: Kill ki_dtor aio: Kill ki_users aio: Kill unneeded kiocb members aio: Kill aio_rw_vect_retry() aio: Don't use ctx->tail unnecessarily aio: io_cancel() no longer returns the io_event aio: percpu ioctx refcount aio: percpu reqs_available aio: reqs_active -> reqs_available aio: fix build when migration is disabled ...
| * | | aio: rcu_read_lock protection for new rcu_dereference callsArtem Savkov2013-09-091-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch "aio: fix rcu sparse warnings introduced by ioctx table lookup patch" (77d30b14d24e557f89c41980011d72428514d729 in linux-next.git) introduced a couple of new rcu_dereference calls which are not protected by rcu_read_lock and result in following warnings during syscall fuzzing(trinity): [ 471.646379] =============================== [ 471.649727] [ INFO: suspicious RCU usage. ] [ 471.653919] 3.11.0-next-20130906+ #496 Not tainted [ 471.657792] ------------------------------- [ 471.661235] fs/aio.c:503 suspicious rcu_dereference_check() usage! [ 471.665968] [ 471.665968] other info that might help us debug this: [ 471.665968] [ 471.672141] [ 471.672141] rcu_scheduler_active = 1, debug_locks = 1 [ 471.677549] 1 lock held by trinity-child0/3774: [ 471.681675] #0: (&(&mm->ioctx_lock)->rlock){+.+...}, at: [<c119ba1a>] SyS_io_setup+0x63a/0xc70 [ 471.688721] [ 471.688721] stack backtrace: [ 471.692488] CPU: 1 PID: 3774 Comm: trinity-child0 Not tainted 3.11.0-next-20130906+ #496 [ 471.698437] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 471.703151] 00000000 00000000 c58bbf30 c18a814b de2234c0 c58bbf58 c10a4ec6 c1b0d824 [ 471.709544] c1b0f60e 00000001 00000001 c1af61b0 00000000 cb670ac0 c3aca000 c58bbfac [ 471.716251] c119bc7c 00000002 00000001 00000000 c119b8dd 00000000 c10cf684 c58bbfb4 [ 471.722902] Call Trace: [ 471.724859] [<c18a814b>] dump_stack+0x4b/0x66 [ 471.728772] [<c10a4ec6>] lockdep_rcu_suspicious+0xc6/0x100 [ 471.733716] [<c119bc7c>] SyS_io_setup+0x89c/0xc70 [ 471.737806] [<c119b8dd>] ? SyS_io_setup+0x4fd/0xc70 [ 471.741689] [<c10cf684>] ? __audit_syscall_entry+0x94/0xe0 [ 471.746080] [<c18b1fcc>] syscall_call+0x7/0xb [ 471.749723] [<c1080000>] ? task_fork_fair+0x240/0x260 Signed-off-by: Artem Savkov <artem.savkov@gmail.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: fix race in ring buffer page lookup introduced by page migration supportBenjamin LaHaise2013-09-091-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to the introduction of page migration support in "fs/aio: Add support to aio ring pages migration" / 36bc08cc01709b4a9bb563b35aa530241ddc63e3, mapping of the ring buffer pages was done via get_user_pages() while retaining mmap_sem held for write. This avoided possible races with userland racing an munmap() or mremap(). The page migration patch, however, switched to using mm_populate() to prime the page mapping. mm_populate() cannot be called with mmap_sem held. Instead of dropping the mmap_sem, revert to the old behaviour and simply drop the use of mm_populate() since get_user_pages() will cause the pages to get mapped anyways. Thanks to Al Viro for spotting this issue. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: fix rcu sparse warnings introduced by ioctx table lookup patchBenjamin LaHaise2013-08-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sseveral sparse warnings were caused by missing rcu_dereference() annotations for dereferencing mm->ioctx_table. Thankfully, none of those were actual bugs as the deref was protected by a spin lock in all instances. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
| * | | aio: remove unnecessary debugging from aio_free_ring()Benjamin LaHaise2013-08-301-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 36bc08cc0170 ("fs/aio: Add support to aio ring pages migration") added some debugging code that is not required and resulted in a build error when 98474236f72e ("vfs: make the dentry cache use the lockref infrastructure") was added to the tree. The code is not required, so just delete it. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: table lookup: verify ctx pointerBenjamin LaHaise2013-08-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Another shortcoming of the table lookup patch was revealed where the pointer was not being tested before being dereferenced. Verify this to avoid the NULL pointer dereference. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: fix error handling and rcu usage in "convert the ioctx list to table ↵Benjamin LaHaise2013-08-051-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lookup v3" In the patch "aio: convert the ioctx list to table lookup v3", incorrect handling in the ioctx_alloc() error path was introduced that lead to an ioctx being added via ioctx_add_table() while freed when the ioctx_alloc() call returned -EAGAIN due to hitting the aio_max_nr limit. Fix this by only calling ioctx_add_table() as the last step in ioctx_alloc(). Also, several unnecessary rcu_dereference() calls were added that lead to RCU warnings where the system was already protected by a spin lock for accessing mm->ioctx_table. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: be defensive to ensure request batching is non-zero instead of BUG_ON()Benjamin LaHaise2013-07-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the event that an overflow/underflow occurs while calculating req_batch, clamp the minimum at 1 request instead of doing a BUG_ON(). Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: convert the ioctx list to table lookup v3Benjamin LaHaise2013-07-301-22/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote: > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote: > > When using a large number of threads performing AIO operations the > > IOCTX list may get a significant number of entries which will cause > > significant overhead. For example, when running this fio script: > > > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1 > > blocksize=1024; numjobs=512; thread; loops=100 > > > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to > > 30% CPU time spent by lookup_ioctx: > > > > 32.51% [guest.kernel] [g] lookup_ioctx > > 9.19% [guest.kernel] [g] __lock_acquire.isra.28 > > 4.40% [guest.kernel] [g] lock_release > > 4.19% [guest.kernel] [g] sched_clock_local > > 3.86% [guest.kernel] [g] local_clock > > 3.68% [guest.kernel] [g] native_sched_clock > > 3.08% [guest.kernel] [g] sched_clock_cpu > > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11 > > 2.60% [guest.kernel] [g] memcpy > > 2.33% [guest.kernel] [g] lock_acquired > > 2.25% [guest.kernel] [g] lock_acquire > > 1.84% [guest.kernel] [g] do_io_submit > > > > This patchs converts the ioctx list to a radix tree. For a performance > > comparison the above FIO script was run on a 2 sockets 8 core > > machine. This are the results (average and %rsd of 10 runs) for the > > original list based implementation and for the radix tree based > > implementation: > > > > cores 1 2 4 8 16 32 > > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms > > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43% > > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms > > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75% > > % of radix > > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66% > > to list > > > > To consider the impact of the patch on the typical case of having > > only one ctx per process the following FIO script was run: > > > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1 > > blocksize=1024; numjobs=1; thread; loops=100 > > > > on the same system and the results are the following: > > > > list 58892 ms > > %rsd 0.91% > > radix 59404 ms > > %rsd 0.81% > > % of radix > > relative 100.87% > > to list > > So, I was just doing some benchmarking/profiling to get ready to send > out the aio patches I've got for 3.11 - and it looks like your patch is > causing a ~1.5% throughput regression in my testing :/ ... <snip> I've got an alternate approach for fixing this wart in lookup_ioctx()... Instead of using an rbtree, just use the reserved id in the ring buffer header to index an array pointing the ioctx. It's not finished yet, and it needs to be tidied up, but is most of the way there. -ben -- "Thought is the essence of where you are now." -- kmo> And, a rework of Ben's code, but this was entirely his idea kmo> -Kent bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually free memory. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: double aio_max_nr in calculationsBenjamin LaHaise2013-07-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the changes to use percpu counters for aio event ring size calculation, existing increases to aio_max_nr are now insufficient to allow for the allocation of enough events. Double the value used for aio_max_nr to account for the doubling introduced by the percpu slack. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: Kill ki_dtorKent Overstreet2013-07-301-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sock_aio_dtor() is dead code - and stuff that does need to do cleanup can simply do it before calling aio_complete(). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: Kill ki_usersKent Overstreet2013-07-301-35/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kiocb refcount is only needed for cancellation - to ensure a kiocb isn't freed while a ki_cancel callback is running. But if we restrict ki_cancel callbacks to not block (which they currently don't), we can simply drop the refcount. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: Kill unneeded kiocb membersKent Overstreet2013-07-301-29/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The old aio retry infrastucture needed to save the various arguments to to aio operations. But with the retry infrastructure gone, we can trim struct kiocb quite a bit. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: Kill aio_rw_vect_retry()Kent Overstreet2013-07-306-82/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code doesn't serve any purpose anymore, since the aio retry infrastructure has been removed. This change should be safe because aio_read/write are also used for synchronous IO, and called from do_sync_read()/do_sync_write() - and there's no looping done in the sync case (the read and write syscalls). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: Don't use ctx->tail unnecessarilyKent Overstreet2013-07-301-19/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | aio_complete() (arguably) needs to keep its own trusted copy of the tail pointer, but io_getevents() doesn't have to use it - it's already using the head pointer from the ring buffer. So convert it to use the tail from the ring buffer so it touches fewer cachelines and doesn't contend with the cacheline aio_complete() needs. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: io_cancel() no longer returns the io_eventKent Overstreet2013-07-301-30/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally, io_event() was documented to return the io_event if cancellation succeeded - the io_event wouldn't be delivered via the ring buffer like it normally would. But this isn't what the implementation was actually doing; the only driver implementing cancellation, the usb gadget code, never returned an io_event in its cancel function. And aio_complete() was recently changed to no longer suppress event delivery if the kiocb had been cancelled. This gets rid of the unused io_event argument to kiocb_cancel() and kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if kiocb->ki_cancel() returned success. Also tweak the refcounting in kiocb_cancel() to make more sense. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: percpu ioctx refcountKent Overstreet2013-07-301-39/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This just converts the ioctx refcount to the new generic dynamic percpu refcount code. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: percpu reqs_availableKent Overstreet2013-07-301-7/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | See the previous patch ("aio: reqs_active -> reqs_available") for why we want to do this - this basically implements a per cpu allocator for reqs_available that doesn't actually allocate anything. Note that we need to increase the size of the ringbuffer we allocate, since a single thread won't necessarily be able to use all the reqs_available slots - some (up to about half) might be on other per cpu lists, unavailable for the current thread. We size the ringbuffer based on the nr_events userspace passed to io_setup(), so this is a slight behaviour change - but nr_events wasn't being used as a hard limit before, it was being rounded up to the next page before so this doesn't change the actual semantics. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: reqs_active -> reqs_availableKent Overstreet2013-07-301-16/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The number of outstanding kiocbs is one of the few shared things left that has to be touched for every kiocb - it'd be nice to make it percpu. We can make it per cpu by treating it like an allocation problem: we have a maximum number of kiocbs that can be outstanding (i.e. slots) - then we just allocate and free slots, and we know how to write per cpu allocators. So as prep work for that, we convert reqs_active to reqs_available. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: fix build when migration is disabledBenjamin LaHaise2013-07-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When "fs/aio: Add support to aio ring pages migration" was applied, it broke the build when CONFIG_MIGRATION was disabled. Wrap the migration code with a test for CONFIG_MIGRATION to fix this and save a few bytes when migration is disabled. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | fs/aio: Add support to aio ring pages migrationGu Zheng2013-07-161-11/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the aio job will pin the ring pages, that will lead to mem migrated failed. In order to fix this problem we use an anon inode to manage the aio ring pages, and setup the migratepage callback in the anon inode's address space, so that when mem migrating the aio ring pages will be moved to other mem node safely. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | fs/anon_inode: Introduce a new lib function anon_inode_getfile_private()Gu Zheng2013-07-161-0/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new lib function anon_inode_getfile_private(), it creates a new file instance by hooking it up to an anonymous inode, and a dentry that describe the "class" of the file, similar to anon_inode_getfile(), but each file holds a single inode. Furthermore, anyone who wants to create a private anon file will benefit from this change. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
* | | | Merge tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfsLinus Torvalds2013-09-1325-166/+461
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs update #2 from Ben Myers: "Here we have defrag support for v5 superblock, a number of bugfixes and a cleanup or two. - defrag support for CRC filesystems - fix endian worning in xlog_recover_get_buf_lsn - fixes for sparse warnings - fix for assert in xfs_dir3_leaf_hdr_from_disk - fix for log recovery of remote symlinks - fix for log recovery of btree root splits - fixes formemory allocation failures with ACLs - fix for assert in xfs_buf_item_relse - fix for assert in xfs_inode_buf_verify - fix an assignment in an assert that should be a test in xfs_bmbt_change_owner - remove dead code in xlog_recover_inode_pass2" * tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: remove dead code from xlog_recover_inode_pass2 xfs: = vs == typo in ASSERT() xfs: don't assert fail on bad inode numbers xfs: aborted buf items can be in the AIL. xfs: factor all the kmalloc-or-vmalloc fallback allocations xfs: fix memory allocation failures with ACLs xfs: ensure we copy buffer type in da btree root splits xfs: set remote symlink buffer type for recovery xfs: recovery of swap extents operations for CRC filesystems xfs: swap extents operations for CRC filesystems xfs: check magic numbers in dir3 leaf verifier first xfs: fix some minor sparse warnings xfs: fix endian warning in xlog_recover_get_buf_lsn()