linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	inode: move to per-sb LRU locks	Dave Chinner	2011-07-20	3	-15/+16
\| \| \| \| \| \| \| \| \| \|	With the inode LRUs moving to per-sb structures, there is no longer a need for a global inode_lru_lock. The locking can be made more fine-grained by moving to a per-sb LRU lock, isolating the LRU operations of different filesytsems completely from each other. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	inode: Make unused inode LRU per superblock	Dave Chinner	2011-07-20	3	-11/+85
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The inode unused list is currently a global LRU. This does not match the other global filesystem cache - the dentry cache - which uses per-superblock LRU lists. Hence we have related filesystem object types using different LRU reclaimation schemes. To enable a per-superblock filesystem cache shrinker, both of these caches need to have per-sb unused object LRU lists. Hence this patch converts the global inode LRU to per-sb LRUs. The patch only does rudimentary per-sb propotioning in the shrinker infrastructure, as this gets removed when the per-sb shrinker callouts are introduced later on. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	inode: convert inode_stat.nr_unused to per-cpu counters	Dave Chinner	2011-07-20	1	-5/+11
\| \| \| \| \| \| \| \| \|	Before we split up the inode_lru_lock, the unused inode counter needs to be made independent of the global inode_lru_lock. Convert it to per-cpu counters to do this. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vmscan: add customisable shrinker batch size	Dave Chinner	2011-07-20	2	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	For shrinkers that have their own cond_resched* calls, having shrink_slab break the work down into small batches is not paticularly efficient. Add a custom batchsize field to the struct shrinker so that shrinkers can use a larger batch size if they desire. A value of zero (uninitialised) means "use the default", so behaviour is unchanged by this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vmscan: reduce wind up shrinker->nr when shrinker can't do work	Dave Chinner	2011-07-20	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a shrinker returns -1 to shrink_slab() to indicate it cannot do any work given the current memory reclaim requirements, it adds the entire total_scan count to shrinker->nr. The idea ehind this is that whenteh shrinker is next called and can do work, it will do the work of the previously aborted shrinker call as well. However, if a filesystem is doing lots of allocation with GFP_NOFS set, then we get many, many more aborts from the shrinkers than we do successful calls. The result is that shrinker->nr winds up to it's maximum permissible value (twice the current cache size) and then when the next shrinker call that can do work is issued, it has enough scan count built up to free the entire cache twice over. This manifests itself in the cache going from full to empty in a matter of seconds, even when only a small part of the cache is needed to be emptied to free sufficient memory. Under metadata intensive workloads on ext4 and XFS, I'm seeing the VFS caches increase memory consumption up to 75% of memory (no page cache pressure) over a period of 30-60s, and then the shrinker empties them down to zero in the space of 2-3s. This cycle repeats over and over again, with the shrinker completely trashing the inode and dentry caches every minute or so the workload continues. This behaviour was made obvious by the shrink_slab tracepoints added earlier in the series, and made worse by the patch that corrected the concurrent accounting of shrinker->nr. To avoid this problem, stop repeated small increments of the total scan value from winding shrinker->nr up to a value that can cause the entire cache to be freed. We still need to allow it to wind up, so use the delta as the "large scan" threshold check - if the delta is more than a quarter of the entire cache size, then it is a large scan and allowed to cause lots of windup because we are clearly needing to free lots of memory. If it isn't a large scan then limit the total scan to half the size of the cache so that windup never increases to consume the whole cache. Reducing the total scan limit further does not allow enough wind-up to maintain the current levels of performance, whilst a higher threshold does not prevent the windup from freeing the entire cache under sustained workloads. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vmscan: shrinker->nr updates race and go wrong	Dave Chinner	2011-07-20	1	-13/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	shrink_slab() allows shrinkers to be called in parallel so the struct shrinker can be updated concurrently. It does not provide any exclusio for such updates, so we can get the shrinker->nr value increasing or decreasing incorrectly. As a result, when a shrinker repeatedly returns a value of -1 (e.g. a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire, sometimes updating with the scan count that wasn't used, sometimes losing it altogether. Worse is when a shrinker does work and that update is lost due to racy updates, which means the shrinker will do the work again! Fix this by making the total_scan calculations independent of shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to other updates via cmpxchg loops. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vmscan: add shrink_slab tracepoints	Dave Chinner	2011-07-20	2	-1/+84
\| \| \| \| \| \| \| \| \|	It is impossible to understand what the shrinkers are actually doing without instrumenting the code, so add a some tracepoints to allow insight to be gained. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)	Al Viro	2011-07-20	15	-94/+39
\| \| \| \| \| \|	... and simplify the living hell out of callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	deuglify squashfs_lookup()	Al Viro	2011-07-20	1	-4/+1
\| \| \| \| \| \| \| \|	d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL so no need for that if (inode) ... in there (or ERR_PTR(0), for that matter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	nfsd4_list_rec_dir(): don't bother with reopening rec_file	Al Viro	2011-07-20	1	-31/+21
\| \| \| \| \| \| \|	just rewind it to the beginning before vfs_readdir() and be done with that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	kill useless checks for sb->s_op == NULL	Al Viro	2011-07-20	3	-3/+2
\| \| \| \| \| \|	never is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	btrfs: kill magical embedded struct superblock	Al Viro	2011-07-20	5	-22/+31
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	get rid of pointless checks for dentry->sb == NULL	Al Viro	2011-07-20	2	-2/+1
\| \| \| \| \| \|	it never is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Make ->d_sb assign-once and always non-NULL	Al Viro	2011-07-20	3	-39/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name). Allocates dentry, sets its ->d_sb to given superblock and sets ->d_op accordingly. Old d_alloc(NULL, name) callers are converted to that (all of them know what superblock they want). d_alloc() itself is left only for parent != NULl case; uses __d_alloc(), inserts result into the list of parent's children. Note that now ->d_sb is assign-once and never NULL and ->d_parent is never NULL either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	unexport kern_path_parent()	Al Viro	2011-07-20	1	-1/+0
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	switch vfs_path_lookup() to struct path	Al Viro	2011-07-20	5	-28/+26
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	kill lookup_create()	Al Viro	2011-07-20	2	-37/+18
\| \| \| \| \| \|	folded into the only caller (kern_path_create()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	devtmpfs: get rid of bogus mkdir in create_path()	Al Viro	2011-07-20	1	-24/+18
\| \| \| \| \| \| \| \| \|	We do _NOT_ want to mkdir the path itself - we are preparing to mknod it, after all. Normally it'll fail with -ENOENT and just do nothing, but if somebody has created the parent in the meanwhile, we'll get buggered... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	switch devtmpfs to kern_path_create()	Al Viro	2011-07-20	1	-47/+36
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	switch devtmpfs object creation/removal to separate kernel thread	Al Viro	2011-07-20	1	-73/+149
\| \| \| \| \| \| \| \|	... and give it a namespace where devtmpfs would be mounted on root, thus avoiding abuses of vfs_path_lookup() (it was never intended to be used with LOOKUP_PARENT). Games with credentials are also gone. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	make sure that nsproxy_cache is initialized early enough	Al Viro	2011-07-20	3	-3/+3
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	switch do_spufs_create() to user_path_create(), fix double-unlock	Al Viro	2011-07-20	3	-32/+21
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	new helpers: kern_path_create/user_path_create	Al Viro	2011-07-20	4	-137/+106
\| \| \| \| \| \| \|	combination of kern_path_parent() and lookup_create(). Does not expose struct nameidata to caller. Syscalls converted to that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	kill LOOKUP_CONTINUE	Al Viro	2011-07-20	2	-9/+3
\| \| \| \| \| \|	LOOKUP_PARENT is equivalent to it now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	nfs: LOOKUP_{OPEN,CREATE,EXCL} is set only on the last step	Al Viro	2011-07-20	1	-4/+2
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	cifs_lookup(): LOOKUP_OPEN is set only on the last component	Al Viro	2011-07-20	1	-1/+1
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	ceph: LOOKUP_OPEN is set only when it's the last component	Al Viro	2011-07-20	1	-1/+0
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	jfs_ci_revalidate() is safe from RCU mode	Al Viro	2011-07-20	1	-2/+0
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	LOOKUP_CREATE and LOOKUP_RENAME_TARGET can be set only on the last step	Al Viro	2011-07-20	3	-12/+6
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	no need to check for LOOKUP_OPEN in ->create() instances	Al Viro	2011-07-20	5	-10/+10
\| \| \| \| \| \| \|	... it will be set in nd->flag for all cases with non-NULL nd (i.e. when called from do_last()). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	don't pass nameidata to vfs_create() from ecryptfs_create()	Al Viro	2011-07-20	1	-28/+5
\| \| \| \| \| \| \| \| \|	Instead of playing with removal of LOOKUP_OPEN, mangling (and restoring) nd->path, just pass NULL to vfs_create(). The whole point of what's being done there is to suppress any attempts to open file by underlying fs, which is what nd == NULL indicates. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	don't transliterate lower bits of ->intent.open.flags to FMODE_...	Al Viro	2011-07-20	7	-31/+24
\| \| \| \| \| \|	->create() instances are much happier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Don't pass nameidata when calling vfs_create() from mknod()	Al Viro	2011-07-20	1	-1/+1
\| \| \| \| \| \| \|	All instances can cope with that now (and ceph one actually starts working properly). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	fix mknod() on nfs4 (hopefully)	Al Viro	2011-07-20	1	-12/+12
\| \| \| \| \| \| \| \| \|	a) check the right flags in ->create() (LOOKUP_OPEN, not LOOKUP_CREATE) b) default (!LOOKUP_OPEN) open_flags is O_CREAT\|O_EXCL\|FMODE_READ, not 0 c) lookup_instantiate_filp() should be done only with LOOKUP_OPEN; otherwise we need to issue CLOSE, lest we leak stateid on server. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	nameidata_to_nfs_open_context() doesn't need nameidata, actually...	Al Viro	2011-07-20	1	-6/+7
\| \| \| \| \| \| \|	just open flags; switched to passing just those and renamed to create_nfs_open_context() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	nfs_open_context doesn't need struct path either	Al Viro	2011-07-20	8	-44/+42
\| \| \| \| \| \|	just dentry, please... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	nfs4_opendata doesn't need struct path either	Al Viro	2011-07-20	1	-23/+22
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	nfs4_closedata doesn't need to mess with struct path	Al Viro	2011-07-20	3	-22/+21
\| \| \| \| \| \| \|	instead of path_get()/path_put(), we can just use nfs_sb_{,de}active() to pin the superblock down. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	cifs: fix the type of cifs_demultiplex_thread()	Al Viro	2011-07-20	1	-2/+3
\| \| \| \| \| \| \| \| \|	... and get rid of a bogus typecast, while we are at it; it's not just that we want a function returning int and not void, but cast to pointer to function taking void * and returning void would be (void ()(void )) and not (void )(void ), TYVM... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	ecryptfs_inode_permission() doesn't need to bail out on RCU	Al Viro	2011-07-20	1	-2/+0
\| \| \| \| \| \| \|	... now that inode_permission() can take MAY_NOT_BLOCK and handle it properly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	kill IPERM_FLAG_RCU	Al Viro	2011-07-20	1	-2/+0
\| \| \| \| \| \|	not used anymore Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	->permission() sanitizing: document API changes	Al Viro	2011-07-20	1	-3/+7
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	merge do_revalidate() into its only caller	Al Viro	2011-07-20	1	-24/+18
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	no reason to keep exec_permission() separate now	Al Viro	2011-07-20	1	-41/+4
\| \| \| \| \| \|	cache footprint alone makes it a bad idea... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	massage generic_permission() to treat directories on a separate path	Al Viro	2011-07-20	1	-4/+13
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	->permission() sanitizing: don't pass flags to exec_permission()	Al Viro	2011-07-20	3	-27/+7
\| \| \| \| \| \| \|	pass mask instead; kill security_inode_exec_permission() since we can use security_inode_permission() instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	selinux: don't transliterate MAY_NOT_BLOCK to IPERM_FLAG_RCU	Al Viro	2011-07-20	2	-3/+3
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	->permission() sanitizing: don't pass flags to ->inode_permission()	Al Viro	2011-07-20	5	-8/+13
\| \| \| \| \| \|	pass that via mask instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	->permission() sanitizing: don't pass flags to ->permission()	Al Viro	2011-07-20	31	-55/+55
\| \| \| \| \| \|	not used by the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	->permission() sanitizing: don't pass flags to generic_permission()	Al Viro	2011-07-20	16	-19/+18
\| \| \| \| \| \| \|	redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of them removes that bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>