summaryrefslogtreecommitdiffstats
path: root/fs/dcache.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* mm: remove include/linux/bootmem.hMike Rapoport2018-10-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dcache: allocate external names from reclaimable kmalloc cachesVlastimil Babka2018-10-271-29/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We can use the newly introduced kmalloc-reclaimable-X caches, to allocate external names in dcache, which will take care of the proper accounting automatically, and also improve anti-fragmentation page grouping. This effectively reverts commit f1782c9bc547 ("dcache: account external names as indirectly reclaimable memory") and instead passes __GFP_RECLAIMABLE to kmalloc(). The accounting thus moves from NR_INDIRECTLY_RECLAIMABLE_BYTES to NR_SLAB_RECLAIMABLE, which is also considered in MemAvailable calculation and overcommit decisions. Link: http://lkml.kernel.org/r/20180731090649.16028-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/dcache.c: fix kmemcheck splat at take_dentry_name_snapshot()Tetsuo Handa2018-08-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes are initialized at __d_alloc(), we can't copy the whole size unconditionally. WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50) 636f6e66696766732e746d70000000000010000000000000020000000188ffff i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u ^ RIP: 0010:take_dentry_name_snapshot+0x28/0x50 RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246 RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002 RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60 RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001 R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00 R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0 take_dentry_name_snapshot+0x28/0x50 vfs_rename+0x128/0x870 SyS_rename+0x3b2/0x3d0 entry_SYSCALL_64_fastpath+0x1a/0xa4 0xffffffffffffffff Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*-. Merge branches 'work.misc' and 'work.dcache' of ↵Linus Torvalds2018-08-141-33/+6
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Misc cleanups from various folks all over the place I expected more fs/dcache.c cleanups this cycle, so that went into a separate branch. Said cleanups have missed the window, so in the hindsight it could've gone into work.misc instead. Decided not to cherry-pick, thus the 'work.dcache' branch" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: dcache: Use true and false for boolean values fold generic_readlink() into its only caller fs: shave 8 bytes off of struct inode fs: Add more kernel-doc to the produced documentation fs: Fix attr.c kernel-doc removed extra extern file_fdatawait_range * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill dentry_update_name_case()
| | * kill dentry_update_name_case()Al Viro2018-06-231-27/+0
| | | | | | | | | | | | | | | | | | | | | the last user is gone Spotted-by: Richard Weinberger <richard@nod.at> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | fs: dcache: Use true and false for boolean valuesGustavo A. R. Silva2018-08-051-6/+6
| |/ | | | | | | | | | | | | | | | | | | Return statements in functions returning bool should use true or false instead of an integer value. This issue was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'work.mkdir' of ↵Linus Torvalds2018-08-141-28/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs icache updates from Al Viro: - NFS mkdir/open_by_handle race fix - analogous solution for FUSE, replacing the one currently in mainline - new primitive to be used when discarding halfway set up inodes on failed object creation; gives sane warranties re icache lookups not returning such doomed by still not freed inodes. A bunch of filesystems switched to that animal. - Miklos' fix for last cycle regression in iget5_locked(); -stable will need a slightly different variant, unfortunately. - misc bits and pieces around things icache-related (in adfs and jfs). * 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: jfs: don't bother with make_bad_inode() in ialloc() adfs: don't put inodes into icache new helper: inode_fake_hash() vfs: don't evict uninitialized inode jfs: switch to discard_new_inode() ext2: make sure that partially set up inodes won't be returned by ext2_iget() udf: switch to discard_new_inode() ufs: switch to discard_new_inode() btrfs: switch to discard_new_inode() new primitive: discard_new_inode() kill d_instantiate_no_diralias() nfs_instantiate(): prevent multiple aliases for directory inode
| * | new primitive: discard_new_inode()Al Viro2018-08-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't want open-by-handle picking half-set-up in-core struct inode from e.g. mkdir() having failed halfway through. In other words, we don't want such inodes returned by iget_locked() on their way to extinction. However, we can't just have them unhashed - otherwise open-by-handle immediately *after* that would've ended up creating a new in-core inode over the on-disk one that is in process of being freed right under us. Solution: new flag (I_CREATING) set by insert_inode_locked() and removed by unlock_new_inode() and a new primitive (discard_new_inode()) to be used by such halfway-through-setup failure exits instead of unlock_new_inode() / iput() combinations. That primitive unlocks new inode, but leaves I_CREATING in place. iget_locked() treats finding an I_CREATING inode as failure (-ESTALE, once we sort out the error propagation). insert_inode_locked() treats the same as instant -EBUSY. ilookup() treats those as icache miss. [Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | kill d_instantiate_no_diralias()Al Viro2018-08-021-27/+0
| |/ | | | | | | | | | | | | | | | | | | | | | | | | The only user is fuse_create_new_entry(), and there it's used to mitigate the same mkdir/open-by-handle race as in nfs_mkdir(). The same solution applies - unhash the mkdir argument, then call d_splice_alias() and if that returns a reference to preexisting alias, dput() and report success. ->mkdir() argument left unhashed negative with the preexisting alias moved in the right place is just fine from the ->mkdir() callers point of view. Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | make sure that __dentry_kill() always invalidates d_seq, unhashed or notAl Viro2018-08-101-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RCU pathwalk relies upon the assumption that anything that changes ->d_inode of a dentry will invalidate its ->d_seq. That's almost true - the one exception is that the final dput() of already unhashed dentry does *not* touch ->d_seq at all. Unhashing does, though, so for anything we'd found by RCU dcache lookup we are fine. Unfortunately, we can *start* with an unhashed dentry or jump into it. We could try and be careful in the (few) places where that could happen. Or we could just make the final dput() invalidate the damn thing, unhashed or not. The latter is much simpler and easier to backport, so let's do it that way. Reported-by: "Dae R. Jeong" <threeearcat@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | root dentries need RCU-delayed freeingAl Viro2018-08-061-2/+4
|/ | | | | | | | | | | | | | | | | | Since mountpoint crossing can happen without leaving lazy mode, root dentries do need the same protection against having their memory freed without RCU delay as everything else in the tree. It's partially hidden by RCU delay between detaching from the mount tree and dropping the vfsmount reference, but the starting point of pathwalk can be on an already detached mount, in which case umount-caused RCU delay has already passed by the time the lazy pathwalk grabs rcu_read_lock(). If the starting point happens to be at the root of that vfsmount *and* that vfsmount covers the entire filesystem, we get trouble. Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'work.misc' of ↵Linus Torvalds2018-06-041-49/+34
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Misc bits and pieces not fitting into anything more specific" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: delete unnecessary assignment in vfs_listxattr Documentation: filesystems: update filesystem locking documentation vfs: namei: use path_equal() in follow_dotdot() fs.h: fix outdated comment about file flags __inode_security_revalidate() never gets NULL opt_dentry make xattr_getsecurity() static vfat: simplify checks in vfat_lookup() get rid of dead code in d_find_alias() it's SB_BORN, not MS_BORN... msdos_rmdir(): kill BS comment remove rpc_rmdir() fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()
| * get rid of dead code in d_find_alias()Al Viro2018-05-131-49/+34
| | | | | | | | | | | | | | | | | | | | All "try disconnected alias if nothing else fits" logics in d_find_alias() got accidentally disabled by Neil a while ago; for most of the callers it was the right thing to do, so fixes belong in few callers that *do* want disconnected aliases. This just takes the now-dead code in d_find_alias() out. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'work.dcache' of ↵Linus Torvalds2018-06-041-79/+43
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache updates from Al Viro: "This is the first part of dealing with livelocks etc around shrink_dcache_parent()." * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: restore cond_resched() in shrink_dcache_parent() dput(): turn into explicit while() loop dcache: move cond_resched() into the end of __dentry_kill() d_walk(): kill 'finish' callback d_invalidate(): unhash immediately
| * | restore cond_resched() in shrink_dcache_parent()Al Viro2018-04-201-2/+7
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | dput(): turn into explicit while() loopAl Viro2018-04-161-18/+13
| | | | | | | | | | | | | | | | | | | | | No need to mess with gotos when the code yielded by straight while() isn't any worse... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | dcache: move cond_resched() into the end of __dentry_kill()Al Viro2018-04-161-3/+1
| | | | | | | | | | | | | | | | | | cond_resched() in shrink_dentry_list() is too early Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | d_walk(): kill 'finish' callbackAl Viro2018-04-161-11/+7
| | | | | | | | | | | | | | | | | | no users left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | d_invalidate(): unhash immediatelyAl Viro2018-04-161-46/+16
| |/ | | | | | | | | | | | | | | Once that is done, we can just hunt mountpoints down one by one; no new mountpoints can be added from now on, so we don't need anything tricky in finish() callback, etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* / do d_instantiate/unlock_new_inode combinations safelyAl Viro2018-05-111-0/+22
|/ | | | | | | | | | | | | | | | | | | | | | | | | For anything NFS-exported we do _not_ want to unlock new inode before it has grown an alias; original set of fixes got the ordering right, but missed the nasty complication in case of lockdep being enabled - unlock_new_inode() does lockdep_annotate_inode_mutex_key(inode) which can only be done before anyone gets a chance to touch ->i_mutex. Unfortunately, flipping the order and doing unlock_new_inode() before d_instantiate() opens a window when mkdir can race with open-by-fhandle on a guessed fhandle, leading to multiple aliases for a directory inode and all the breakage that follows from that. Correct solution: a new primitive (d_instantiate_new()) combining these two in the right order - lockdep annotate, then d_instantiate(), then the rest of unlock_new_inode(). All combinations of d_instantiate() with unlock_new_inode() should be converted to that. Cc: stable@kernel.org # 2.6.29 and later Tested-by: Mike Marshall <hubcap@omnibond.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs/dcache.c: add cond_resched() in shrink_dentry_list()Nikolay Borisov2018-04-111-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As previously reported (https://patchwork.kernel.org/patch/8642031/) it's possible to call shrink_dentry_list with a large number of dentries (> 10000). This, in turn, could trigger the softlockup detector and possibly trigger a panic. In addition to the unmount path being vulnerable to this scenario, at SuSE we've observed similar situation happening during process exit on processes that touch a lot of dentries. Here is an excerpt from a crash dump. The number after the colon are the number of dentries on the list passed to shrink_dentry_list: PID 99760: 10722 PID 107530: 215 PID 108809: 24134 PID 108877: 21331 PID 141708: 16487 So we want to kill between 15k-25k dentries without yielding. And one possible call stack looks like: 4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8 5 [ffff8839ece41db0] evict at ffffffff811c3026 6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258 7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593 8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830 9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61 10 [ffff8839ece41ec0] release_task at ffffffff81059ebd 11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce 12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53 13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909 While some of the callers of shrink_dentry_list do use cond_resched, this is not sufficient to prevent softlockups. So just move cond_resched into shrink_dentry_list from its callers. David said: I've found hundreds of occurrences of warnings that we emit when need_resched stays set for a prolonged period of time with the stack trace that is included in the change log. Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dcache: account external names as indirectly reclaimable memoryRoman Gushchin2018-04-111-9/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I received a report about suspicious growth of unreclaimable slabs on some machines. I've found that it happens on machines with low memory pressure, and these unreclaimable slabs are external names attached to dentries. External names are allocated using generic kmalloc() function, so they are accounted as unreclaimable. But they are held by dentries, which are reclaimable, and they will be reclaimed under the memory pressure. In particular, this breaks MemAvailable calculation, as it doesn't take unreclaimable slabs into account. This leads to a silly situation, when a machine is almost idle, has no memory pressure and therefore has a big dentry cache. And the resulting MemAvailable is too low to start a new workload. To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is used to track the amount of memory, consumed by external names. The counter is increased in the dentry allocation path, if an external name structure is allocated; and it's decreased in the dentry freeing path. To reproduce the problem I've used the following Python script: import os for iter in range (0, 10000000): try: name = ("/some_long_name_%d" % iter) + "_" * 220 os.stat(name) except Exception: pass Without this patch: $ cat /proc/meminfo | grep MemAvailable MemAvailable: 7811688 kB $ python indirect.py $ cat /proc/meminfo | grep MemAvailable MemAvailable: 2753052 kB With the patch: $ cat /proc/meminfo | grep MemAvailable MemAvailable: 7809516 kB $ python indirect.py $ cat /proc/meminfo | grep MemAvailable MemAvailable: 7749144 kB [guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB] Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com [guro@fb.com: fix indirectly reclaimable memory accounting] Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* d_genocide: move export to definitionAl Viro2018-03-291-2/+2
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fold dentry_lock_for_move() into its sole caller and clean it upAl Viro2018-03-291-26/+23
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* make non-exchanging __d_move() copy ->d_parent rather than swap themAl Viro2018-03-291-63/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently d_move(from, to) does the following: * name/parent of from <- old name/parent of to, from hashed there * to is unhashed * name of to is preserved * if from used to be detached, to gets detached * if from used to be attached, parent of to <- old parent of from. That's both user-visibly bogus and complicates reasoning a lot. Much saner semantics would be * name/parent of from <- name/parent of to, from hashed there. * to is unhashed * name/parent of to is unchanged. The price, of course, is that old parent of from might lose a reference. However, * all potentially cross-directory callers of d_move() have both parents pinned directly; typically, dentries themselves are grabbed only after we have grabbed and locked both parents. IOW, the decrement of old parent's refcount in case of d_move() won't reach zero. * __d_move() from d_splice_alias() is done to detached alias. No refcount decrements in that case * __d_move() from __d_unalias() *can* get the refcount to zero. So let's grab a reference to alias' old parent before calling __d_unalias() and dput() it after we'd dropped rename_lock. That does make d_splice_alias() potentially blocking. However, it has no callers in non-sleepable contexts (and the case where we'd grown that dget/dput pair is _very_ rare, so performance is not an issue). Another thing that needs adjustment is unlocking in the end of __d_move(); folded it in. And cleaned the remnants of bogus ordering from the "lock them in the beginning" counterpart - it's never been right and now (well, for 7 years now) we have that thing always serialized on rename_lock anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* split d_path() and friends into a separate fileAl Viro2018-03-291-466/+1
| | | | | | Those parts of fs/dcache.c are pretty much self-contained. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* dcache.c: trim includesAl Viro2018-03-291-3/+0
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs/dcache: Avoid a try_lock loop in shrink_dentry_list()John Ogness2018-03-291-31/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | shrink_dentry_list() holds dentry->d_lock and needs to acquire dentry->d_inode->i_lock. This cannot be done with a spin_lock() operation because it's the reverse of the regular lock order. To avoid ABBA deadlocks it is done with a trylock loop. Trylock loops are problematic in two scenarios: 1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are preemptible. As a consequence the i_lock holder can be preempted by a higher priority task. If that task executes the trylock loop it will do so forever and live lock. 2) In virtual machines trylock loops are problematic as well. The VCPU on which the i_lock holder runs can be scheduled out and a task on a different VCPU can loop for a whole time slice. In the worst case this can lead to starvation. Commits 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45f9 ("shrink_dentry_list(): take parent's d_lock earlier") are addressing exactly those symptoms. Avoid the trylock loop by using dentry_kill(). When pruning ancestors, the same code applies that is used to kill a dentry in dput(). This also has the benefit that the locking order is now the same. First the inode is locked, then the parent. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* get rid of trylock loop around dentry_kill()Al Viro2018-03-291-7/+27
| | | | | | | | | | | In case when trylock in there fails, deal with it directly in dentry_kill(). Note that in cases when we drop and retake ->d_lock, we need to recheck whether to retain the dentry. Another thing is that dropping/retaking ->d_lock might have ended up with negative dentry turning into positive; that, of course, can happen only once... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* handle move to LRU in retain_dentry()Al Viro2018-03-291-13/+6
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* dput(): consolidate the "do we need to retain it?" into an inlined helperAl Viro2018-03-291-19/+23
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* split the slow part of lock_parent() offAl Viro2018-03-291-6/+12
| | | | | | Turn the "trylock failed" part into uninlined __lock_parent(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* now lock_parent() can't run into killed dentryAl Viro2018-03-291-10/+3
| | | | | | all remaining callers hold either a reference or ->i_lock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* get rid of trylock loop in locking dentries on shrink listAl Viro2018-03-291-37/+67
| | | | | | | | | | | | | | | | In case of trylock failure don't re-add to the list - drop the locks and carefully get them in the right order. For shrink_dentry_list(), somebody having grabbed a reference to dentry means that we can kick it off-list, so if we find dentry being modified under us we don't need to play silly buggers with retries anyway - off the list it is. The locking logics taken out into a helper of its own; lock_parent() is no longer used for dentries that can be killed under us. [fix from Eric Biggers folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* d_delete(): get rid of trylock loopAl Viro2018-03-121-19/+9
| | | | | | | just grab ->i_lock first; we have a positive dentry, nothing's going to happen to inode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs/dcache: Move dentry_kill() below lock_parent()John Ogness2018-03-121-31/+31
| | | | | | | | | | A subsequent patch will modify dentry_kill() to call lock_parent(). Move the dentry_kill() implementation "as is" below lock_parent() first. This will help simplify the review of the subsequent patch with dentry_kill() changes. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs/dcache: Remove stale comment from dentry_kill()John Ogness2018-03-121-1/+0
| | | | | | | | | Commit 0d98439ea3c6 ("vfs: use lockred "dead" flag to mark unrecoverably dead dentries") removed the `ref' parameter in dentry_kill() but its documentation remained. Remove it. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* take write_seqcount_invalidate() into __d_drop()Al Viro2018-03-121-22/+22
| | | | | | ... and reorder it with making d_unhashed() true. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: dcache: Use READ_ONCE when accessing i_dir_seqWill Deacon2018-02-251-1/+1
| | | | | | | | | | i_dir_seq is subject to concurrent modification by a cmpxchg or store-release operation, so ensure that the relaxed access in d_alloc_parallel uses READ_ONCE. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: dcache: Avoid livelock between d_alloc_parallel and __d_addWill Deacon2018-02-251-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If d_alloc_parallel runs concurrently with __d_add, it is possible for d_alloc_parallel to continuously retry whilst i_dir_seq has been incremented to an odd value by __d_add: CPU0: __d_add n = start_dir_add(dir); cmpxchg(&dir->i_dir_seq, n, n + 1) == n CPU1: d_alloc_parallel retry: seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1; hlist_bl_lock(b); bit_spin_lock(0, (unsigned long *)b); // Always succeeds CPU0: __d_lookup_done(dentry) hlist_bl_lock bit_spin_lock(0, (unsigned long *)b); // Never succeeds CPU1: if (unlikely(parent->d_inode->i_dir_seq != seq)) { hlist_bl_unlock(b); goto retry; } Since the simple bit_spin_lock used to implement hlist_bl_lock does not provide any fairness guarantees, then CPU1 can starve CPU0 of the lock and prevent it from reaching end_dir_add(dir), therefore CPU1 cannot exit its retry loop because the sequence number always has the bottom bit set. This patch resolves the livelock by not taking hlist_bl_lock in d_alloc_parallel if the sequence counter is odd, since any subsequent masked comparison with i_dir_seq will fail anyway. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: Naresh Madhusudana <naresh.madhusudana@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* lock_parent() needs to recheck if dentry got __dentry_kill'ed under itAl Viro2018-02-241-3/+8
| | | | | | | | | | | | | | | In case when dentry passed to lock_parent() is protected from freeing only by the fact that it's on a shrink list and trylock of parent fails, we could get hit by __dentry_kill() (and subsequent dentry_kill(parent)) between unlocking dentry and locking presumed parent. We need to recheck that dentry is alive once we lock both it and parent *and* postpone rcu_read_unlock() until after that point. Otherwise we could return a pointer to struct dentry that already is rcu-scheduled for freeing, with ->d_lock held on it; caller's subsequent attempt to unlock it can end up with memory corruption. Cc: stable@vger.kernel.org # 3.12+, counting backports Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'overlayfs-linus' of ↵Linus Torvalds2018-02-051-31/+57
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This work from Amir adds NFS export capability to overlayfs. NFS exporting an overlay filesystem is a challange because we want to keep track of any copy-up of a file or directory between encoding the file handle and decoding it. This is achieved by indexing copied up objects by lower layer file handle. The index is already used for hard links, this patchset extends the use to NFS file handle decoding" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (51 commits) ovl: check ERR_PTR() return value from ovl_encode_fh() ovl: fix regression in fsnotify of overlay merge dir ovl: wire up NFS export operations ovl: lookup indexed ancestor of lower dir ovl: lookup connected ancestor of dir in inode cache ovl: hash non-indexed dir by upper inode for NFS export ovl: decode pure lower dir file handles ovl: decode indexed dir file handles ovl: decode lower file handles of unlinked but open files ovl: decode indexed non-dir file handles ovl: decode lower non-dir file handles ovl: encode lower file handles ovl: copy up before encoding non-connectable dir file handle ovl: encode non-indexed upper file handles ovl: decode connected upper dir file handles ovl: decode pure upper file handles ovl: encode pure upper file handles ovl: document NFS export vfs: factor out helpers d_instantiate_anon() and d_alloc_anon() ovl: store 'has_upper' and 'opaque' as bit flags ...
| * vfs: factor out helpers d_instantiate_anon() and d_alloc_anon()Miklos Szeredi2018-01-241-31/+56
| | | | | | | | | | | | | | | | Those helpers are going to be used by overlayfs to implement NFS export decode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * ovl: verify directory index entries on mountAmir Goldstein2018-01-241-0/+1
| | | | | | | | | | | | | | | | | | | | Directory index entries should have 'upper' xattr pointing to the real upper dir. Verifying that the upper dir file handle is not stale is expensive, so only verify stale directory index entries on mount if NFS export feature is enabled. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | Merge tag 'usercopy-v4.16-rc1' of ↵Linus Torvalds2018-02-041-4/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardened usercopy whitelisting from Kees Cook: "Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass all hardened usercopy checks since these sizes cannot change at runtime.) This new check is WARN-by-default, so any mistakes can be found over the next several releases without breaking anyone's system. The series has roughly the following sections: - remove %p and improve reporting with offset - prepare infrastructure and whitelist kmalloc - update VFS subsystem with whitelists - update SCSI subsystem with whitelists - update network subsystem with whitelists - update process memory with whitelists - update per-architecture thread_struct with whitelists - update KVM with whitelists and fix ioctl bug - mark all other allocations as not whitelisted - update lkdtm for more sensible test overage" * tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits) lkdtm: Update usercopy tests for whitelisting usercopy: Restrict non-usercopy caches to size 0 kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl kvm: whitelist struct kvm_vcpu_arch arm: Implement thread_struct whitelist for hardened usercopy arm64: Implement thread_struct whitelist for hardened usercopy x86: Implement thread_struct whitelist for hardened usercopy fork: Provide usercopy whitelisting for task_struct fork: Define usercopy region in thread_stack slab caches fork: Define usercopy region in mm_struct slab caches net: Restrict unwhitelisted proto caches to size 0 sctp: Copy struct sctp_sock.autoclose to userspace using put_user() sctp: Define usercopy region in SCTP proto slab cache caif: Define usercopy region in caif proto slab cache ip: Define usercopy region in IP proto slab cache net: Define usercopy region in struct proto slab cache scsi: Define usercopy region in scsi_sense_cache slab cache cifs: Define usercopy region in cifs_request slab cache vxfs: Define usercopy region in vxfs_inode slab cache ufs: Define usercopy region in ufs_inode_cache slab cache ...
| * | vfs: Define usercopy region in names_cache slab cachesDavid Windsor2018-01-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | VFS pathnames are stored in the names_cache slab cache, either inline or across an entire allocation entry (when approaching PATH_MAX). These are copied to/from userspace, so they must be entirely whitelisted. cache object allocation: include/linux/fs.h: #define __getname() kmem_cache_alloc(names_cachep, GFP_KERNEL) example usage trace: strncpy_from_user+0x4d/0x170 getname_flags+0x6f/0x1f0 user_path_at_empty+0x23/0x40 do_mount+0x69/0xda0 SyS_mount+0x83/0xd0 fs/namei.c: getname_flags(...): ... result = __getname(); ... kname = (char *)result->iname; result->name = kname; len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX); ... if (unlikely(len == EMBEDDED_NAME_MAX)) { const size_t size = offsetof(struct filename, iname[1]); kname = (char *)result; result = kzalloc(size, GFP_KERNEL); ... result->name = kname; len = strncpy_from_user(kname, filename, PATH_MAX); In support of usercopy hardening, this patch defines the entire cache object in the names_cache slab cache as whitelisted, since it may entirely hold name strings to be copied to/from userspace. This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, add usage trace] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
| * | dcache: Define usercopy region in dentry_cache slab cacheDavid Windsor2018-01-151-2/+3
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a dentry name is short enough, it can be stored directly in the dentry itself (instead in a separate kmalloc allocation). These dentry short names, stored in struct dentry.d_iname and therefore contained in the dentry_cache slab cache, need to be coped to userspace. cache object allocation: fs/dcache.c: __d_alloc(...): ... dentry = kmem_cache_alloc(dentry_cache, ...); ... dentry->d_name.name = dentry->d_iname; example usage trace: filldir+0xb0/0x140 dcache_readdir+0x82/0x170 iterate_dir+0x142/0x1b0 SyS_getdents+0xb5/0x160 fs/readdir.c: (called via ctx.actor by dir_emit) filldir(..., const char *name, ...): ... copy_to_user(..., name, namlen) fs/libfs.c: dcache_readdir(...): ... next = next_positive(dentry, p, 1) ... dir_emit(..., next->d_name.name, ...) In support of usercopy hardening, this patch defines a region in the dentry_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamic copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust hunks for kmalloc-specific things moved later] [kees: adjust commit log, provide usage trace] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
* | Merge branch 'KASAN-read_word_at_a_time'Linus Torvalds2018-02-011-6/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge KASAN word-at-a-time fixups from Andrey Ryabinin. The word-at-a-time optimizations have caused headaches for KASAN, since the whole point is that we access byte streams in bigger chunks, and KASAN can be unhappy about the potential extra access at the end of the string. We used to have a horrible hack in dcache, and then people got complaints from the strscpy() case. This fixes it all up properly, by adding an explicit helper for the "access byte stream one word at a time" case. * emailed patches from Andrey Ryabinin <aryabinin@virtuozzo.com>: fs: dcache: Revert "manually unpoison dname after allocation to shut up kasan's reports" fs/dcache: Use read_word_at_a_time() in dentry_string_cmp() lib/strscpy: Shut up KASAN false-positives in strscpy() compiler.h: Add read_word_at_a_time() function. compiler.h, kasan: Avoid duplicating __read_once_size_nocheck()
| * | fs: dcache: Revert "manually unpoison dname after allocation to shut up ↵Andrey Ryabinin2018-02-011-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kasan's reports" This reverts commit df4c0e36f1b1782b0611a77c52cc240e5c4752dd. It's no longer needed since dentry_string_cmp() now uses read_word_at_a_time() to avoid kasan's reports. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | fs/dcache: Use read_word_at_a_time() in dentry_string_cmp()Andrey Ryabinin2018-02-011-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dentry_string_cmp() performs the word-at-a-time reads from 'cs' and may read slightly more than it was requested in kmallac(). Normally this would make KASAN to report out-of-bounds access, but this was workarounded by commit df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports"). This workaround is not perfect, since it allows out-of-bounds access to dentry's name for all the code, not just in dentry_string_cmp(). So it would be better to use read_word_at_a_time() instead and revert commit df4c0e36f1b1. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>